Skip to content
forked from PM-Cardoso/DPMM

Clustering data using Dirichlet process mixture models.

License

Notifications You must be signed in to change notification settings

tjmckinley/DPMM

 
 

Repository files navigation

DPMM_fit

Fitting a DPMM to complete/incomplete datasets with continuous and categorical variables.


Datasets included in this repository:

  • Dataset 1: two continuous variables with distinct clusters of data
  • Dataset 2: two continuous variables + one categorical variable. Two separate clusters (same X1 values), only distinction between both clusters is the categorical variable
  • Dataset 3: two categorical variables with distint clusters of data
  • Dataset 4: Y variables, three continuous + one categorical predictor variables
  • Dataset 5: two continuous variables + two categorical variables. Two separate clusters (same X1 values), only distinction between both clusters is the categorical variables
  • Dataset 6: two continuous variables with distinct clusters of data. Equivalent to Dataset 1, but more correlation between both variables

Fitting the model is done with dpmm_fit.R. This file includes the function runModel(dataset, mcmc_iterations, L, standardise):

  • dataset: a dataframe with any number of continuous and categorical variables with/without missing values
  • mcmc_iterations: number of iterations of the MCMC algorithm
  • L: maximum number of components to fit the model (should be below optimal number of components)
  • standardise: logical (default = TRUE), standardises all continuous variables Returns a list of class dpmm_fit with all the information provided and standardisation terms.

conditional_RW.R and conditional_RW_block.R provide conditional gaussian updates for missing continuous variables.


Plots

Traceplots for all DPMM parameters can be plotted through plot_dpmm_fit.R. This includes the function plot.dpmm_fit(x, trace = TRUE, density = TRUE):

  • x: an object of class dpmm_fit
  • trace: logical (default = TRUE) plot traceplots
  • density: logical (default = TRUE) plot density plots Returns a series of plots for all parameters.

The number of components used in the DPMM can be analysed through plot_alpha.R. This compares several DPMMs by plotting the number of components with used through the iterations, the average number of individuals in ranked components and a traceplot of alpha values. This file includes the function plot.alpha(x):

  • x: a list of several dpmm_fit objects Returns a plot divided into three sections.

A plot of random samples can be generated with plot_ggpairs.R. This can produce a plot for DPMM samples or a comparison versus a new dataset (with equal number of samples) when new dataset provided. This file has the function plot.ggpairs(x, newdata, iterations, nburn):

  • x: object of class dpmm_fit or list of dpmm_fit objects
  • newdata: a dataframe with identical structure to data in DPMM fit. newdata can only be used with nburn. Random samples from x will have the same number of draws as newdata.
  • iterations: vector of iterations for random samples of x. This can only be used with x and will provide a GGally::ggpairs() plot for random samples of x.
  • nburn: cut-off point for burning iterations. Can only be used when newdata is supplied, and cannot be supplied alongside iterations Returns a generalised pairs plot.

Sampling

Predictions from the DPMM are made with the files predict_dpmm_fit.R and posterior_dpmm.R. The file predict_dpmm_fit.R as the function predict.dpmm_fit(object, newdata, samples):

  • object: object of class dpmm_fit or ggpairs.fit
  • newdata: a dataframe with identical structure to data in the DPMM fit. It needs to have missing values for prediction. When not provided, random samples are taken from the DPMM.
  • samples: vector of iterations to be used for prediciton. This function calls the function posterior_dpmm to make predictions and returns a list of posterior predictive distributions for all missing values or a list of random samples from DPMM.

The file posterior_dpmm.R contains the function _posterior_dpmm(patient, samples, seed, cont_vars, cat_vars):

  • patient: a datafrmae with identical structure to data in DPMM fit. Requires missing values to make predictions.
  • samples: a dataframe with mcmc samples for all parameters of DPMM.
  • seed: sets seed for predictions (default = NULL)
  • cont_vars: character vector of names for continuous variables (should be NULL if no continuous variables in the model)
  • cat_vars: character vector of names for categorical variables (should be NULL if no categorical variables in the model) This function makes conditional predictions from a DPMM for a mixture of continuous and categorical variables.

Demo

The file demo.R includes code for several capabilities of the DPMM model. This file explores all 6 datasets and all post-processing functions.

About

Clustering data using Dirichlet process mixture models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%