DPMM_fit

Fitting a DPMM to complete/incomplete datasets with continuous and categorical variables.

Datasets included in this repository:

Dataset 1: two continuous variables with distinct clusters of data
Dataset 2: two continuous variables + one categorical variable. Two separate clusters (same X1 values), only distinction between both clusters is the categorical variable
Dataset 3: two categorical variables with distint clusters of data
Dataset 4: Y variables, three continuous + one categorical predictor variables
Dataset 5: two continuous variables + two categorical variables. Two separate clusters (same X1 values), only distinction between both clusters is the categorical variables
Dataset 6: two continuous variables with distinct clusters of data. Equivalent to Dataset 1, but more correlation between both variables

Fitting the model is done with dpmm_fit.R. This file includes the function runModel(dataset, mcmc_iterations, L, standardise):

dataset: a dataframe with any number of continuous and categorical variables with/without missing values
mcmc_iterations: number of iterations of the MCMC algorithm
L: maximum number of components to fit the model (should be below optimal number of components)
standardise: logical (default = TRUE), standardises all continuous variables Returns a list of class dpmm_fit with all the information provided and standardisation terms.

conditional_RW.R and conditional_RW_block.R provide conditional gaussian updates for missing continuous variables.

Plots

Traceplots for all DPMM parameters can be plotted through plot_dpmm_fit.R. This includes the function plot.dpmm_fit(x, trace = TRUE, density = TRUE):

x: an object of class dpmm_fit
trace: logical (default = TRUE) plot traceplots
density: logical (default = TRUE) plot density plots Returns a series of plots for all parameters.

The number of components used in the DPMM can be analysed through plot_alpha.R. This compares several DPMMs by plotting the number of components with used through the iterations, the average number of individuals in ranked components and a traceplot of alpha values. This file includes the function plot.alpha(x):

x: a list of several dpmm_fit objects Returns a plot divided into three sections.

A plot of random samples can be generated with plot_ggpairs.R. This can produce a plot for DPMM samples or a comparison versus a new dataset (with equal number of samples) when new dataset provided. This file has the function plot.ggpairs(x, newdata, iterations, nburn):

x: object of class dpmm_fit or list of dpmm_fit objects
newdata: a dataframe with identical structure to data in DPMM fit. newdata can only be used with nburn. Random samples from x will have the same number of draws as newdata.
iterations: vector of iterations for random samples of x. This can only be used with x and will provide a GGally::ggpairs() plot for random samples of x.
nburn: cut-off point for burning iterations. Can only be used when newdata is supplied, and cannot be supplied alongside iterations Returns a generalised pairs plot.

Sampling

Predictions from the DPMM are made with the files predict_dpmm_fit.R and posterior_dpmm.R. The file predict_dpmm_fit.R as the function predict.dpmm_fit(object, newdata, samples):

object: object of class dpmm_fit or ggpairs.fit
newdata: a dataframe with identical structure to data in the DPMM fit. It needs to have missing values for prediction. When not provided, random samples are taken from the DPMM.
samples: vector of iterations to be used for prediciton. This function calls the function posterior_dpmm to make predictions and returns a list of posterior predictive distributions for all missing values or a list of random samples from DPMM.

The file posterior_dpmm.R contains the function _posterior_dpmm(patient, samples, seed, cont_vars, cat_vars):

patient: a datafrmae with identical structure to data in DPMM fit. Requires missing values to make predictions.
samples: a dataframe with mcmc samples for all parameters of DPMM.
seed: sets seed for predictions (default = NULL)
cont_vars: character vector of names for continuous variables (should be NULL if no continuous variables in the model)
cat_vars: character vector of names for categorical variables (should be NULL if no categorical variables in the model) This function makes conditional predictions from a DPMM for a mixture of continuous and categorical variables.

Demo

The file demo.R includes code for several capabilities of the DPMM model. This file explores all 6 datasets and all post-processing functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPMM_fit

Plots

Sampling

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md
conditional_RW.R		conditional_RW.R
conditional_RW_block.R		conditional_RW_block.R
demo.R		demo.R
dpmm_fit.R		dpmm_fit.R
plot_alpha.R		plot_alpha.R
plot_dpmm_fit.R		plot_dpmm_fit.R
plot_ggpairs.R		plot_ggpairs.R
posterior_dpmm.R		posterior_dpmm.R
predict_dpmm_fit.R		predict_dpmm_fit.R

License

tjmckinley/DPMM

Folders and files

Latest commit

History

Repository files navigation

DPMM_fit

Plots

Sampling

Demo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages