Title: | Variables Clustering |
---|---|
Description: | Performs clustering of quantitative variables, assuming that clusters lie in low-dimensional subspaces. Segmentation of variables, number of clusters and their dimensions are selected based on BIC. Candidate models are identified based on many runs of K-means algorithm with different random initializations of cluster centers. |
Authors: | Piotr Sobczyk, Stanislaw Wilczynski, Julie Josse, Malgorzata Bogdan |
Maintainer: | Piotr Sobczyk <[email protected]> |
License: | GPL-3 |
Version: | 0.10.0 |
Built: | 2025-02-18 05:28:03 UTC |
Source: | https://github.com/psobczyk/varclust |
Generates data for simulation with a low-rank subspace structure: variables are clustered and each cluster has a low-rank representation. Factors than span subspaces are not shared between clusters.
data.simulation( n = 100, SNR = 1, K = 10, numb.vars = 30, max.dim = 2, min.dim = 1, equal.dims = TRUE )
data.simulation( n = 100, SNR = 1, K = 10, numb.vars = 30, max.dim = 2, min.dim = 1, equal.dims = TRUE )
n |
An integer, number of individuals. |
SNR |
A numeric, signal to noise ratio measured as variance of the variable, element of a subspace, to the variance of noise. |
K |
An integer, number of subspaces. |
numb.vars |
An integer, number of variables in each subspace. |
max.dim |
An integer, if equal.dims is TRUE then max.dim is dimension of each subspace. If equal.dims is FALSE then subspaces dimensions are drawn from uniform distribution on [min.dim,max.dim]. |
min.dim |
An integer, minimal dimension of subspace . |
equal.dims |
A boolean, if TRUE (value set by default) all clusters are of the same dimension. |
A list consisting of:
X |
matrix, generated data |
signals |
matrix, data without noise |
dims |
vector, dimensions of subspaces |
factors |
matrix, columns of which span subspaces |
s |
vector, true partiton of variables |
sim.data <- data.simulation() sim.data2 <- data.simulation( n = 30, SNR = 2, K = 5, numb.vars = 20, max.dim = 3, equal.dims = FALSE )
sim.data <- data.simulation() sim.data2 <- data.simulation( n = 30, SNR = 2, K = 5, numb.vars = 20, max.dim = 3, equal.dims = FALSE )
Generating data for simulation with a low-rank subspace structure: variables are clustered and each cluster has a low-rank representation. Factors that span subspaces are shared between clusters.
data.simulation.factors( n = 100, SNR = 1, K = 10, numb.vars = 30, numb.factors = 10, min.dim = 1, max.dim = 2, equal.dims = TRUE, separation.parameter = 0.1 )
data.simulation.factors( n = 100, SNR = 1, K = 10, numb.vars = 30, numb.factors = 10, min.dim = 1, max.dim = 2, equal.dims = TRUE, separation.parameter = 0.1 )
n |
An integer, number of individuals. |
SNR |
A numeric, signal to noise ratio measured as variance of the variable, element of a subspace, to the variance of noise. |
K |
An integer, number of subspaces. |
numb.vars |
An integer, number of variables in each subspace. |
numb.factors |
An integer, number of factors from which subspaces basis will be drawn. |
min.dim |
An integer, minimal dimension of subspace . |
max.dim |
An integer, if equal.dims is TRUE then max.dim is dimension of each subspace. If equal.dims is FALSE then subspaces dimensions are drawn from uniform distribution on [min.dim,max.dim]. |
equal.dims |
A boolean, if TRUE (value set by default) all clusters are of the same dimension. |
separation.parameter |
a numeric, coefficients of variables in each subspace basis are drawn from range [separation.parameter,1] |
A list consisting of:
X |
matrix, generated data |
signals |
matrix, data without noise |
factors |
matrix, columns of which span subspaces |
indices |
list of vectors, indices of factors that span subspaces |
dims |
vector, dimensions of subspaces |
s |
vector, true partiton of variables |
sim.data <- data.simulation.factors() sim.data2 <- data.simulation.factors( n = 30, SNR = 2, K = 5, numb.vars = 20, numb.factors = 10, max.dim = 3, equal.dims = FALSE, separation.parameter = 0.2 )
sim.data <- data.simulation.factors() sim.data2 <- data.simulation.factors( n = 30, SNR = 2, K = 5, numb.vars = 20, numb.factors = 10, max.dim = 3, equal.dims = FALSE, separation.parameter = 0.2 )
Integartion and acontamination are measures of the quality of a clustering
with a reference to a true partition. Let be the
data set,
be a partition into clusters
(true
partition) and
be a partition into clusters
.
Then for cluster
integration is eqaul to:
The for which the value is
maximized is called the integrating cluster of
. Then the
integration for the whole clustering equals is
.The acontamination is defined by:
where is the integrating cluster for
. Then the
acontamination for the whole dataset is
integration(group, true_group)
integration(group, true_group)
group |
A vector, first partition. |
true_group |
A vector, second (reference) partition. |
An array containing values of integration and acontamination.
M. Sołtys. Metody analizy skupień. Master’s thesis, Wrocław University of Technology, 2010
sim.data <- data.simulation(n = 20, SNR = 1, K = 2, numb.vars = 50, max.dim = 2) true_segmentation <- rep(1:2, each = 50) mlcc.fit <- mlcc.reps(sim.data$X, numb.clusters = 2, max.dim = 2, numb.cores = 1) integration(mlcc.fit$segmentation, true_segmentation)
sim.data <- data.simulation(n = 20, SNR = 1, K = 2, numb.vars = 50, max.dim = 2) true_segmentation <- rep(1:2, each = 50) mlcc.fit <- mlcc.reps(sim.data$X, numb.clusters = 2, max.dim = 2, numb.cores = 1) integration(mlcc.fit$segmentation, true_segmentation)
Missclasification is a commonly used performance measure in subspace clustering. It allows to compare two partitions with the same number of clusters.
misclassification(group, true_group, M, K)
misclassification(group, true_group, M, K)
group |
A vector, first partition. |
true_group |
A vector, second (reference) partition. |
M |
An integer, maximal number of elements in one class. |
K |
An integer, number of classes. |
As getting exact value of misclassification requires checking all permutations and is therefore intrackable even for modest number of clusters, a heuristic approach is proposed. It is assumed that there are K classes of maximum M elements. Additional requirement is that classes labels are from range [1, K].
Misclassification rate.
R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52-68,2011
sim.data <- data.simulation(n = 100, SNR = 1, K = 5, numb.vars = 30, max.dim = 2) mlcc.fit <- mlcc.reps(sim.data$X, numb.clusters = 5, numb.runs = 20, max.dim = 2, numb.cores = 1) misclassification(mlcc.fit$segmentation, sim.data$s, 30, 5) # one can use this function not only for clusters partition1 <- sample(10, 300, replace = TRUE) partition2 <- sample(10, 300, replace = TRUE) misclassification(partition1, partition1, max(table(partition1)), 10) misclassification(partition1, partition2, max(table(partition2)), 10)
sim.data <- data.simulation(n = 100, SNR = 1, K = 5, numb.vars = 30, max.dim = 2) mlcc.fit <- mlcc.reps(sim.data$X, numb.clusters = 5, numb.runs = 20, max.dim = 2, numb.cores = 1) misclassification(mlcc.fit$segmentation, sim.data$s, 30, 5) # one can use this function not only for clusters partition1 <- sample(10, 300, replace = TRUE) partition2 <- sample(10, 300, replace = TRUE) misclassification(partition1, partition1, max(table(partition1)), 10) misclassification(partition1, partition2, max(table(partition2)), 10)
This function is an implementation of Multiple Latent Components Clustering
(MLCC) algorithm which clusteres quantitative variables into a number, chosen
using mBIC, of groups. For each considered number of clusters in
numb.clusters mlcc.reps
function is called. It invokes
K-means based algorithm (mlcc.kmeans
) finding local minimum of
mBIC, which is run a given number of times (numb.runs) with different
initializations. The best partition is choosen with mBIC (see
mlcc.reps
function).
mlcc.bic( X, numb.clusters = 1:10, numb.runs = 30, stop.criterion = 1, max.iter = 30, max.dim = 4, scale = TRUE, numb.cores = NULL, greedy = TRUE, estimate.dimensions = TRUE, verbose = FALSE, flat.prior = FALSE, show.warnings = FALSE )
mlcc.bic( X, numb.clusters = 1:10, numb.runs = 30, stop.criterion = 1, max.iter = 30, max.dim = 4, scale = TRUE, numb.cores = NULL, greedy = TRUE, estimate.dimensions = TRUE, verbose = FALSE, flat.prior = FALSE, show.warnings = FALSE )
X |
A data frame or a matrix with only continuous variables. |
numb.clusters |
A vector, numbers of clusters to be checked. |
numb.runs |
An integer, number of runs (initializations) of
|
stop.criterion |
An integer, if an iteration of
|
max.iter |
An integer, maximum number of iterations of the loop in
|
max.dim |
An integer, if estimate.dimensions is FALSE then max.dim is dimension of each subspace. If estimate.dimensions is TRUE then subspaces dimensions are estimated from the range [1, max.dim]. |
scale |
A boolean, if TRUE (value set by default) then variables in dataset are scaled to zero mean and unit variance. |
numb.cores |
An integer, number of cores to be used, by default all cores are used. |
greedy |
A boolean, if TRUE (value set by default) the clusters are estimated in a greedy way - first local minimum of mBIC is chosen. |
estimate.dimensions |
A boolean, if TRUE (value set by default) subspaces dimensions are estimated. |
verbose |
A boolean, if TRUE plot with mBIC values for different numbers of clusters is produced and values of mBIC, computed for every number of clusters and subspaces dimensions, are printed (value set by default is FALSE). |
flat.prior |
A boolean, if TRUE then, instead of an informative prior that takes into account number of models for a given number of clusters, flat prior is used. |
show.warnings |
A boolean, if set to TRUE all warnings are displayed, default value is FALSE. |
An object of class mlcc.fit consisting of
segmentation |
a vector containing the partition of the variables |
BIC |
numeric, value of mBIC |
subspacesDimensions |
a list containing dimensions of the subspaces |
nClusters |
an integer, estimated number of clusters |
factors |
a list of matrices, basis for each subspace |
all.fit |
a list of segmentation, mBIC, subspaces dimension for all numbers of clusters considered for an estimated subspace dimensions |
all.fit.dims |
a list of lists of segmentation, mBIC, subspaces dimension for all numbers of clusters and subspaces dimensions considered |
sim.data <- data.simulation(n = 50, SNR = 1, K = 3, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.bic(sim.data$X, numb.clusters = 1:5, numb.runs = 20, numb.cores = 1, verbose = TRUE) show.clusters(sim.data$X, mlcc.res$segmentation)
sim.data <- data.simulation(n = 50, SNR = 1, K = 3, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.bic(sim.data$X, numb.clusters = 1:5, numb.runs = 20, numb.cores = 1, verbose = TRUE) show.clusters(sim.data$X, mlcc.res$segmentation)
Performs k-means based subspace clustering. Center of each cluster is some number of principal components. This number can be fixed or estimated by PESEL. Similarity measure between variable and a cluster is calculated using BIC.
mlcc.kmeans( X, number.clusters = 2, stop.criterion = 1, max.iter = 30, max.subspace.dim = 4, initial.segmentation = NULL, estimate.dimensions = TRUE, show.warnings = FALSE )
mlcc.kmeans( X, number.clusters = 2, stop.criterion = 1, max.iter = 30, max.subspace.dim = 4, initial.segmentation = NULL, estimate.dimensions = TRUE, show.warnings = FALSE )
X |
A matrix with only continuous variables. |
number.clusters |
An integer, number of clusters to be used. |
stop.criterion |
An integer indicating how many changes in partitions triggers stopping the algorithm. |
max.iter |
An integer, maximum number of iterations of k-means loop. |
max.subspace.dim |
An integer, maximum dimension of subspaces. |
initial.segmentation |
A vector, initial segmentation of variables to clusters. |
estimate.dimensions |
A boolean, if TRUE (value set by default) subspaces dimensions are estimated. |
show.warnings |
A boolean, if set to TRUE all warnings are displayed, default value is FALSE. |
A list consisting of:
segmentation |
a vector containing the partition of the variables |
pcas |
a list of matrices, basis vectors for each cluster (subspace) |
Bayesian dimensionality reduction with PCA using penalized semi-integrated likelihood, Piotr Sobczyk, Malgorzata Bogdan, Julie Josse
sim.data <- data.simulation(n = 50, SNR = 1, K = 5, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.kmeans(sim.data$X, number.clusters = 5, max.iter = 20, max.subspace.dim = 3) show.clusters(sim.data$X, mlcc.res$segmentation)
sim.data <- data.simulation(n = 50, SNR = 1, K = 5, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.kmeans(sim.data$X, number.clusters = 5, max.iter = 20, max.subspace.dim = 3) show.clusters(sim.data$X, mlcc.res$segmentation)
For a fixed number of cluster function returns the best partition and basis for each subspace.
mlcc.reps( X, numb.clusters = 2, numb.runs = 30, stop.criterion = 1, max.iter = 30, initial.segmentations = NULL, max.dim = 4, scale = TRUE, numb.cores = NULL, estimate.dimensions = TRUE, flat.prior = FALSE, show.warnings = FALSE )
mlcc.reps( X, numb.clusters = 2, numb.runs = 30, stop.criterion = 1, max.iter = 30, initial.segmentations = NULL, max.dim = 4, scale = TRUE, numb.cores = NULL, estimate.dimensions = TRUE, flat.prior = FALSE, show.warnings = FALSE )
X |
A data frame or a matrix with only continuous variables. |
numb.clusters |
An integer, number of cluster. |
numb.runs |
An integer, number of runs of |
stop.criterion |
An integer, if an iteration of
|
max.iter |
max.iter An integer, maximum number of iterations of the loop
in |
initial.segmentations |
A list of vectors, segmentations that user wants
to be used as an initial segmentation in |
max.dim |
An integer, maximal dimension of subspaces. |
scale |
A boolean, if TRUE (value set by default) then variables in dataset are scaled to zero mean and unit variance. |
numb.cores |
An integer, number of cores to be used, by default all cores are used. |
estimate.dimensions |
A boolean, if TRUE (value set by default) subspaces dimensions are estimated. |
flat.prior |
A boolean, if TRUE then, instead of a prior that takes into account number of models for a given number of clusters, flat prior is used. |
show.warnings |
A boolean, if set to TRUE all warnings are displayed, default value is FALSE. |
In more detail, an algorithm mlcc.kmeans
is run a
numb.runs of times with random or custom initializations. The best
partition is selected according to the BIC.
A list consisting of
segmentation |
a vector containing the partition of the variables |
BIC |
a numeric, value of the mBIC |
basis |
a list of matrices, the factors for each of the subspaces |
sim.data <- data.simulation(n = 50, SNR = 1, K = 5, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.reps(sim.data$X, numb.clusters = 5, numb.runs = 20, max.dim = 4, numb.cores = 1) show.clusters(sim.data$X, mlcc.res$segmentation)
sim.data <- data.simulation(n = 50, SNR = 1, K = 5, numb.vars = 50, max.dim = 3) mlcc.res <- mlcc.reps(sim.data$X, numb.clusters = 5, numb.runs = 20, max.dim = 4, numb.cores = 1) show.clusters(sim.data$X, mlcc.res$segmentation)
Print clusters obtained from MLCC
show.clusters(data, segmentation)
show.clusters(data, segmentation)
data |
The original data set. |
segmentation |
A vector, segmentation of variables into clusters. |
Package varclust performs clustering of variables, according to a probabilistic model, which assumes that each cluster lies in a low dimensional subspace. Segmentation of variables, number of clusters and their dimensions are selected based on the appropriate implementation of the Bayesian Information Criterion.
The best candidate models are identified by the specific implementation of K-means algorithm, in which cluster centers are represented by some number of orthogonal factors(principal components of the variables within a cluster) and similarity between a given variable and a cluster center depends on residuals from a linear model fit. Based on the Bayesian Information Criterion (BIC), sums of squares of residuals are appropriately scaled, which allows to avoid an over-excessive attraction by clusters with larger dimensions. To reduce the chance that the local minimum of modified BIC (mBIC) is obtained instead of the global one, for every fixed number of clusters in a given range K-means algorithm is run large number of times, with different random initializations of cluster centers.
The main function of package varclust is mlcc.bic
which
allows clustering variables in a data with unknown number of clusters.
Variable partition is computed with k-means based algorithm. Number of
clusters and their dimensions are estimated using mBIC and PESEL
respectively. If the number of clusters is known one might use function
mlcc.reps
, which takes number of clusters as a parameter. For
mlcc.reps
one might specify as well some initial segmentation
for k-means algorithm. This can be useful if user has some a priori knowledge
about clustering.
We provide also two functions to simulate datasets with described structure.
The function data.simulation
generates the data so that the
subspaces are indepentend and data.simulation.factors
generates
the data where some factores are shared between the subspaces.
We also provide function measures of quality of clustering.
misclassification
computes misclassification rate between two
partitions. This performance measure is extensively used in image
segmentation. The other measure is implemented as integration
function.
Version: 0.9.5
Piotr Sobczyk, Stanislaw Wilczynski, Julie Josse, Malgorzata Bogdan
Maintainer: Piotr Sobczyk [email protected]
sim.data <- data.simulation(n = 50, SNR = 1, K = 3, numb.vars = 50, max.dim = 3) mlcc.bic(sim.data$X, numb.clusters = 1:5, numb.runs = 20, numb.cores = 1, verbose = TRUE) mlcc.reps(sim.data$X, numb.clusters = 3, numb.runs = 20, numb.cores = 1)
sim.data <- data.simulation(n = 50, SNR = 1, K = 3, numb.vars = 50, max.dim = 3) mlcc.bic(sim.data$X, numb.clusters = 1:5, numb.runs = 20, numb.cores = 1, verbose = TRUE) mlcc.reps(sim.data$X, numb.clusters = 3, numb.runs = 20, numb.cores = 1)