The goal of this tutorial is to introduce the most popular and simple clustering algorithms, namely k-means, EM GMM, hierarchical clustering and DBSCAN. We will illustrate their strengths and weaknesses on a few artificial datasets. We will deal with low dimensional data to reinforce understandability of the task.

Mixture of bivariate normal distributions

We will start with a mixture of 4 bivariate normal distributions. The data could be generated as follows:

Sigma <- matrix(c(1,0,0,1),2,2)
d<-mvrnorm(n = 50, mu=c(1,1),Sigma)
d<-rbind(d,mvrnorm(n = 50, mu=c(1,5),Sigma))
d<-rbind(d,mvrnorm(n = 50, mu=c(5,1),Sigma))
d<-rbind(d,mvrnorm(n = 50, mu=c(5,5),Sigma))
d<-cbind(d,rep(c(1,2,3,4),each=50))
plot(d[,1],d[,2],col=d[,3],xlab="x",ylab="y")

d<-as.data.frame(d)
colnames(d)<-c("x","y","class")

Let us start by estimating which methods will work best for this dataset. Obviously, EM GMM should be our first choice since we deal with GMM data. Gaussian mixture is the right model with the right assumption. Similarly, k-means will also work well as we deal with isotropic diagonal covariance matrices, which implies that the clusters will be spherical.

Now we will implement, test and evaluate the individual clustering methods.

K-means clustering

We will use the algorithm proposed by Hartigan and Wong implemented in mclust package. We will run the algorithm multiple times with different cluster seeds and take the best partition out of these random runs. The elbow method is used for figuring out of optimal \(k\) (in our case it should obviously be 4). Gap statistic is used to set the number of clusters automatically.

# plot the homogeneity plot up to k.max
# the best k may be guessed from the elbow in the homogeneity curve
elbowMethod<-function(data,k.max){
  wss <- sapply(1:k.max, 
          function(k){kmeans(data, k, nstart=50,iter.max = 15)$tot.withinss})
  plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")
}

elbowMethod(d[,c(1,2)],10) # manually estimate the optimal number of clusters with the elbow method

gapStat<-clusGap(d[,c(1,2)], FUNcluster = kmeans, K.max = 10) # find the optimal number of clusters with the gap statistic
km.out<-kmeans(d[,c(1,2)],centers=which.max(gapStat$Tab[,3]),nstart=10) # run k-means with the optimal number of clusters
km.out$cluster # see its outcome
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
##  [38] 2 2 2 2 2 2 2 1 2 2 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 4 4 3 3 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 1
## [186] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
table(d[,3],km.out$cluster) # confusion matrix clusters vs true classes
##    
##      1  2  3  4
##   1  1 46  3  0
##   2 48  0  0  2
##   3  1  0 49  0
##   4  3  0  2 45
plot(d[,1],d[,2],col=d[,3],xlab="x",ylab="y")
points(x=d[,1],y=d[,2],col=km.out$cluster,pch=4)

plot(x=d[,1],y=d[,2],col=km.out$cluster,pch=4,xlab="x",ylab="y") # mismatch between clusters and classes

EM GMM

Here, model-based clustering based on parameterized finite Gaussian mixture models will be used. Models are estimated by EM algorithm initialized by hierarchical model-based agglomerative clustering. The optimal model is selected according to Bayesian information criterion (BIC). BIC sums the model negative log likelihood and the number of model parameters, models with lower BIC are generally preferred. In our case, we distinguish between spherical equal volume clusters (EII, the right case for our dataset), spherical unequal volume clusters (VII), ellipsoidal, equal volume, shape, and orientation clusters (EEE), …, or ellipsoidal, varying volume, shape, and orientation clusters (VVV).

BIC <- mclustBIC(d[,c(1,2)]) # find the best model with BIC
BIC # check that we truly found EII with k=4 which corresponds to the dataset
## Bayesian Information Criterion (BIC): 
##         EII       VII       EEI       VEI       EVI       VVI       EEE
## 1 -1786.616 -1786.616 -1791.838 -1791.838 -1791.838 -1791.838 -1796.632
## 2 -1801.789 -1775.858 -1759.958 -1781.148 -1748.772 -1780.806 -1764.293
## 3 -1791.462 -1753.862 -1761.741 -1740.120 -1744.490 -1747.030 -1766.744
## 4 -1723.512 -1735.794 -1728.587 -1740.402 -1742.045 -1752.005 -1730.933
## 5 -1736.954 -1754.010 -1742.089 -1759.096 -1755.915 -1773.807 -1745.349
## 6 -1750.150 -1772.218 -1754.954 -1776.242 -1772.708 -1792.198 -1759.549
## 7 -1760.429 -1784.115 -1760.220 -1784.651 -1785.772 -1813.930 -1764.432
## 8 -1767.984 -1795.419 -1770.766 -1799.662 -1802.709 -1833.662 -1775.532
## 9 -1783.938 -1810.004 -1786.711 -1813.274 -1819.541 -1846.405 -1791.484
##         VEE       EVE       VVE       EEV       VEV       EVV       VVV
## 1 -1796.632 -1796.632 -1796.632 -1796.632 -1796.632 -1796.632 -1796.632
## 2 -1781.753 -1769.590 -1780.420 -1756.566 -1778.686 -1758.661 -1783.032
## 3 -1745.394 -1749.776 -1752.447 -1758.244 -1752.440 -1759.733 -1759.870
## 4 -1742.827 -1744.965 -1755.511 -1745.659 -1756.061 -1759.917 -1770.223
## 5 -1762.638 -1758.955 -1772.797 -1758.309 -1776.476 -1776.985 -1793.115
## 6 -1774.732 -1775.722 -1793.638 -1776.727 -1799.059 -1794.616 -1816.531
## 7 -1788.062 -1786.841 -1813.775 -1791.242 -1816.002 -1813.514 -1837.239
## 8 -1805.016 -1804.124 -1837.747 -1810.064 -1833.675 -1837.356 -1866.282
## 9 -1819.119 -1829.115 -1858.144 -1828.672 -1862.825 -1856.887 -1885.284
## 
## Top 3 models based on the BIC criterion: 
##     EII,4     EEI,4     EEE,4 
## -1723.512 -1728.587 -1730.933
gmm <- Mclust(d[,c(1,2)], x = BIC) # learn the EM GMM model
summary(gmm, parameters = TRUE)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EII (spherical, equal volume) model with 4 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -829.9659 200 12 -1723.512 -1752.111
## 
## Clustering table:
##  1  2  3  4 
## 46 53 55 46 
## 
## Mixing probabilities:
##         1         2         3         4 
## 0.2310949 0.2650084 0.2731766 0.2307202 
## 
## Means:
##        [,1]     [,2]     [,3]     [,4]
## x 0.7590998 1.079758 4.840720 4.946206
## y 1.4326625 5.122370 1.224376 5.184590
## 
## Variances:
## [,,1]
##         x       y
## x 1.09485 0.00000
## y 0.00000 1.09485
## [,,2]
##         x       y
## x 1.09485 0.00000
## y 0.00000 1.09485
## [,,3]
##         x       y
## x 1.09485 0.00000
## y 0.00000 1.09485
## [,,4]
##         x       y
## x 1.09485 0.00000
## y 0.00000 1.09485
plot(x=d[,1],y=d[,2],col=gmm$classification,pch=4,xlab="x",ylab="y")

Hierarchical clustering

Hierarchical clustering builds a hierarchy of clusters. We will compare three ways of agglomerative hierarchical clustering differing in the linkage function that defines similarity between clusters. The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram. Note that the number of clusters is set manually which may favor this class of methods in later quality assessment.

hc.complete<-hclust(dist(d[,c(1,2)]), method="complete")
hc.average<-hclust(dist(d[,c(1,2)]), method="average")
hc.single<-hclust(dist(d[,c(1,2)]), method="single")
par(mfrow=c(1,3))
plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)

hc.complete.out<-cutree(hc.complete,4)
hc.average.out<-cutree(hc.average,4)
hc.single.out<-cutree(hc.single,4)
plot(x=d[,1],y=d[,2],col=hc.complete.out,pch=4,xlab="x",ylab="y")
plot(x=d[,1],y=d[,2],col=hc.average.out,pch=4,xlab="x",ylab="y")
plot(x=d[,1],y=d[,2],col=hc.single.out,pch=4,xlab="x",ylab="y")

Density-based clustering

DBSCAN is a non-parametric density-based clustering algorithm. It categorizes points in a space by grouping closely packed points (those with numerous nearby neighbors) and identifies outliers as points that exist alone in low-density regions (where their nearest neighbors are too far away). Epsilon and minPts are two parameters that influence the clustering process. Epsilon defines the radius within which the algorithm searches for neighboring points around a given data point. Larger values of epsilon result in larger neighborhoods, potentially causing more points to be considered neighbors and leading to larger clusters. MinPts is the minimum number of data points required to form a dense region or a cluster. A larger minPts value makes the algorithm more conservative in forming clusters.

kNNdistplot(d[,c(1,2)], k = 4) # find the best settings for Epsilon and minPts, look for the knee!
abline(h=.9, col = "red", lty=2) # the knee gives the optimal eps for the given minPts/k

dbscan<-dbscan(d[,c(1,2)],eps=0.9, minPts = 4) # the knee seems to be around 0.9, however, clustering is not satisfiable, 2 clusters originate 
dbscan<-dbscan(d[,c(1,2)],eps=0.65, minPts = 4) # use a smaller eps to have more clusters
dbscan$cluster # zero indicates noise points
##   [1] 1 1 1 1 0 1 1 1 1 3 1 1 2 1 1 1 1 1 0 1 3 1 1 0 3 1 1 3 1 0 1 1 2 1 1 3 1
##  [38] 0 1 1 3 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 2 2 0 2 2 2 2 2 2 2 2
## [112] 2 2 0 1 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 0
## [149] 2 2 4 4 2 2 4 4 4 4 4 4 4 4 0 1 0 4 0 4 4 4 4 4 0 0 4 4 4 1 4 4 0 4 0 4 1
## [186] 4 0 0 4 4 4 4 4 4 4 4 4 4 0 4
plot(x=d[,1],y=d[,2],col=dbscan$cluster+1,pch=4,xlab="x",ylab="y")

Clustering quality

The clustering quality can directly be evaluated as the gold standard clusters (also referred to as classes) match the original mixture elements. We can employ the measures of purity (clusterPurity implemented below), normalized mutual information (NMI taken from NMI library) or rand index (adjustedRandIndex taken from mclust library). For all these measures it holds that higher is better and as expected, EM GMM seems to work the best.

clusterPurity <- function(clusters, classes) {
  sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

clustOut<-list(kmeans=km.out$cluster,gmm=gmm$classification,hcomp=hc.complete.out,havg=hc.average.out,hsingle=hc.single.out,dbscan=dbscan$cluster)

quality<-data.frame(row.names=c("k-means","EM GMM","Hcomplete","Haverage","Hsingle","DBSCAN"),
                    purity=sapply(clustOut,function(x) clusterPurity(x,d$class)),
                    NMI=sapply(clustOut,function(x) unlist(NMI(data.frame(seq(nrow(d)),x),data.frame(seq(nrow(d)),d$class)))),
                    adjustedRandIndex=sapply(clustOut,function(x) adjustedRandIndex(x,d$class)))

kable(quality)
purity NMI adjustedRandIndex
k-means 0.940 0.8218236 0.8444515
EM GMM 0.935 0.8110647 0.8321106
Hcomplete 0.880 0.7227227 0.7055602
Haverage 0.915 0.7765625 0.7829971
Hsingle 0.270 0.0372515 0.0003118
DBSCAN 0.690 0.5268760 0.4628981

Challenges in clustering

Let us show the influence of some simple modifications of the introductory task. We will change the scale in one of the dimensions, add outliers or change cluster density and shapes. We will see what is their influence on clustering partitions and how they match with the gold standard classes.

# change the scale in the second dimension
d[,2]<-d[,2]/100 
# scale(d[,c(1,2)]) # scale the data back if needed

# add outliers, two objects very distant from the rest
d[1,c(1:2)]<-c(-100,-100) 
d[100,c(1:2)]<-c(100,100)

# change cluster density
# remove 80% objects from clusters 1 and 4
d<-d[c(41:160),]

# change cluster shapes
Sigma1 <- matrix(c(1,0.9,0.9,1),2,2)
Sigma2 <- matrix(c(2,-0.8,-0.8,1),2,2)
d<-mvrnorm(n = 50, mu=c(1,1),Sigma2)
d<-rbind(d,mvrnorm(n = 50, mu=c(1,5),Sigma1))
d<-rbind(d,mvrnorm(n = 50, mu=c(5,1),Sigma1))
d<-rbind(d,mvrnorm(n = 50, mu=c(5,5),Sigma2))
d<-cbind(d,rep(c(1,2,3,4),each=50))

# change cluster shapes as well as cluster centers
Sigma2 <- matrix(c(1,-0.98,-0.98,1),2,2)
d<-mvrnorm(n = 50, mu=c(1,1),Sigma2)
d<-rbind(d,mvrnorm(n = 50, mu=c(2,2),Sigma2))
d<-rbind(d,mvrnorm(n = 50, mu=c(3,3),Sigma2))
d<-rbind(d,mvrnorm(n = 50, mu=c(4,4),Sigma2))
d<-cbind(d,rep(c(1,2,3,4),each=50))

plot(d[,1],d[,2],col=d[,3])

d<-as.data.frame(d)
colnames(d)<-c("x","y","class")

Individual work:

Your task is to play with the data generator settings and demonstrate their impact on the individual algorithms. You can propose your own changes, for example, to push forward your favorite method. Remember that we worked with GM generator which favors EM GMM. Eventually, verbally summarize advantages and disadvantages of the individual clustering algorithms wrt different settings.

Changing the scale

With the scaling in one of the axis, the clustering outcome gets much worse. Only the first dimension really matters, the second one gets unimportant for object distances, information is lost. The only exception is EM GMM, which is scaling invariant. The method rescales its mean and variance parameters and the clustering quality remains the same. In general, we always have to use a good measure of distance between objects which may include scaling.

Adding the outliers

With these two severe outliers, DBSCAN can naturally recognize them as noise and maintain its performance. For the rest of the methods, it is important to assign the outliers to separate clusters. Above all, it means to recognize the right number of clusters. BIC works well for EM GMM, the right model EII with 6 clusters was found and EM GMM as a whole works very well too. For k-means, the elbow method fails to identify outliers as new clusters which may result in performance decrease. If the number of clusters is set to 6 manually, the method works well too. In hierarchical clustering, dendrograms point out the outliers clearly, the outliers may make it more difficult to recognize the rest of the clusters. When finding them properly and setting the number of clusters to 6, the original performance remain maintained too.

Cluster density changes

The identification of the right number of clusters is difficult. Both k-means with gap statistic and EM GMM with BIC tend to propose 2 clusters. DBSCAN may miss less dense clusters too and treat them as noise. This estimate leads to decrease in clustering quality measures. If the number of clusters is found correctly (such as the manual setting in hierarchical clustering), the performance is more or less maintained.

Changing cluster shapes

When changing the cluster shapes, EM GMM works very well. We still have to keep on our mind that the clustering model matches the model used by data generator. K-means often fails in finding the right number of clusters, the cluster shapes are far away from isotropic (=round) as well. Single-link clustering does not work at all. DBSCAN suffers from poor separation between clusters.

When changing the cluster shapes as well as their centers, EM GMM still works very well, k-means further worsens, single link surprisingly does not improve much and DBSCAN cateches up on EM GMM.