Clustering algorithms

Contents

EM clustering
- Gaussian Mixture Model (GMM)
- Expectation maximization to solve GMM parameters
k-means clustering
- Below code shows the implementation
Hierarchical clustering

Clustering is a unsupervised Machine Learning algorithm that groups a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). [Suggested introductory reading]. Here, I shall talk about three of them namely EM clustering, k-means and hierarchical clustering.

EM clustering

We have $N$ data samples (unlabelled) $D= \{ x_1,x_2,\dots,x_N \}$ spanning $c$ classes $w_1,w_2,\dots,w_c$. In order to simplify our analysis let’s assume $c=3$ and $N=500$. We can write the prior and likelihood as follows

Prior	Likelihood
$P(w_1)$	$P(x/w_1) \sim \mathcal{N}(\mu_1,‎‎\sum_1)$
$P(w_2)$	$P(x/w_2) \sim \mathcal{N}(\mu_2,‎‎\sum_2)$
$P(w_3)$	$P(x/w_3) \sim \mathcal{N}(\mu_3,‎‎\sum_3)$

Gaussian Mixture Model (GMM)

The problem can be formulated as a GMM

\[P(x/\theta)=P(w_1)P(x/w_1)+P(w_2)P(x/w_2)+P(w_3)P(x/w_3)\]

Given, the data points $\{ x_1,x_2,\dots,x_N \}$ our goal is to find the parameter $\theta$ (consists of $\mu_i \text{, } \sum_i \text{ and } P(w_i)$) using MLE (ML Estimation).

Expectation maximization to solve GMM parameters

We can find the parameters using an iterative scheme. We can initialize $(\mu_i,‎‎\sum_i)$ with some random values and then iterate over the expectation and maximization steps until convergance is achieved.

Expectation

Assuing $(\mu_1,‎‎\sum_1)$ are available, we can perform Bayesian classification on $\{ x_1,x_2,\dots,x_n \}$ using likelihood at some iteration as

\[x_k \in w_i \text{ iff } P(x_k/w_i)>P(x_k/w_j) \forall j \neq i\]

Maximization

Given some $x_k$, then what are $(\mu_i,‎‎\sum_i)$ that maximize for those $x_k$ to have $P(x_k/w_i) \sim \mathcal{N}(\mu_i,‎‎\sum_i)$?

\[\{ x_1,x_2,\dots,x_n \} \in w_i\]

Using MLE,

$\mu_i = \frac{1}{N_i} \sum_k x_k \forall x_k \in w_i$ $\sum_i = \frac{1}{N_i} \sum_k (x_k-\mu_i)(x_k-\mu_i)^T \forall x_k \in w_i$

Thus, we have the updated values for $(\mu_i,‎‎\sum_i)$ and can now again perform expectation step in an iterative way untill convergance.

k-means clustering

Let us assume identity covarience matrix i.e. $ | \sum | = 1 $, we can write $P(x_k/w_i) = \frac{1}{(2\pi)^{d/2}} e^{ \frac{-1}{2} (x_k-\mu_i)^T (x_k-\mu_i)} = \frac{1}{(2\pi)^{d/2}} e^{ \frac{-1}{2} \| x_k-\mu_i \| ^2 }$

Algorithm

Descide the number of clusters, $c$
Assume $\mu_i$ for classes $1:c$
Expectation
- $x_k \in$ if $P(x_k/w_i)>P(x_k/w_j) \forall i \neq j$
- if $\| x_k-\mu_i \| ^2 \leq \| x_k-\mu_j \| ^2 \forall i \neq j, \text{ then } x_k \in w_i$
Maximization $\mu_i = \frac{1}{N_i} \sum_k x_k \forall x_k \in w_i$
Repeat step 2 and 3 untill convergence, $\| \mu_i^t-\mu_i^{t-1} \| ^2 \to \text{constant}$

Below code shows the implementation

Fig.1: Final clusters obtained via k-means

Fig.2: Final clusters obtained via k-means

As shown in the above two figures the k-means algorithm correctly clusters the data points into three clusters.

Hierarchical clustering

Hierarchical clustering is a clustering method which seeks to build a hierarchy of clusters based on some distance or similarity meteric. This can be further classified as

Agglomerative clustering: (“bottom-up” approach) each point starts as a seperate individual cluster and merges into a single cluster.
Divisive clustering: (“top-down” approach) starts with one single cluster and keeps on dividing upto individual point clusters
Isodata clustering: combines features of both agglomerative and divisive clustering to achieve optimal clusters

Based on the distance meteric used the agglomerative clustering can be further classified as

Single link or Nearest neighbour
- Distance between two clusters $c_1$ and $c_2$, $d = \min \| X-Y \|^2 \text{ where, } X \in c_1 \text{ and } Y \in c_2$
Complete link or farthest neighbour
- Distance between two clusters $c_1$ and $c_2$, $d = \max \| X-Y \|^2 \text{ where, } X \in c_1 \text{ and } Y \in c_2$

Example: Single Link

Suppose that we have the given points $x_1,x2,\dots,x7$ with the distances shown below.

x1 ------- x2 ------- x3 ------- x4 ------- x5 ------- x6 ------- x7
     1          1.1        1.2        1.3        1.4        1.5 

To form clusters, we treat the individual points as clusters and grow them. Find the distance between the clusters using $d = \min \| X-Y \|^2 \text{ where, } X \in c_1 \text{ and } Y \in c_2$ and merge $c_i$ with $c_j$ if $d(c_i,c_j) \leq d(c_i,c_k) \forall k \neq j$. Finally we can obtain

x1         x2         x3         x4         x5         x6         x7
|          |          |           |         |          |          |
|          |          |           |         |          |          |
------------ 1        |           |         |          |          |
     |                |           |         |          |          |
     |                |           |         |          |          |
     ------------------ 1.1       |         |          |          |
              |                   |         |          |          |
              |                   |         |          |          |
              --------------------- 1.2     |          |          |
                        |                   |          |          |
                        |                   |          |          |
                        --------------------- 1.3      |          |
                                   |                   |          |
                                   |                   |          |
                                   --------------------- 1.4      |
                                             |                    |
                                             |                    |
                                             ---------------------- 1.5

Example: Complete Link

We can use the same example distances given in single link example. Here, we can find the distance between the clusters using $d = \max \| X-Y \|^2 \text{ where, } X \in c_1 \text{ and } Y \in c_2$ and merge $c_i$ with $c_j$ if $d(c_i,c_j) \leq d(c_i,c_k) \forall k \neq j$. Finally we can obtain

x1         x2         x3         x4         x5         x6         x7
|          |          |           |         |          |          |
|          |          |           |         |          |          |
------------ 1        |           |         |          |          |
     |                ------------- 1.2     |          |          |
     |                      |               ------------ 1.4      |
     |                      |                     |               |    
     |                      |                     |               |  
     |                      |                     ----------------- 2.9  
     ------------------------ 3.3                         |
              |                                           |
              |                                           |
              --------------------------------------------- 7.5
                     

Prior	Likelihood
\(P(w_1)\)	\(P(x/w_1) \sim \mathcal{N}(\mu_1,‎‎\sum_1)\)
\(P(w_2)\)	\(P(x/w_2) \sim \mathcal{N}(\mu_2,‎‎\sum_2)\)
\(P(w_3)\)	\(P(x/w_3) \sim \mathcal{N}(\mu_3,‎‎\sum_3)\)