The K-means algorithm is the first *unsupervised* method we encounter in the course. The previous *supervised* methods worked with a set of observations and labels $\mathcal{T} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$. The *unsupervised* methods use unlabeled observations $\mathcal{T} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$. The goal is to infer properties of a probability distribution $p_X(\mathbf{x})$.

In low-dimensional problems ($dim(\mathbf{x}) \leq 3$), we can estimate the probability density directly using variety of techniques (e.g. histograms, Parzen window). And when we have a prior information about the type of a particular probability distribution, we can directly estimate the distribution parameters.

Due to curse of dimensionality these methods fail in higher dimensions. When the dimensionality increases, the volume grows exponentially and the data becomes exponentially sparser. In such cases we use techniques that do not find the full underlying distribution $p_X(\mathbf{x})$, but instead describe the data in a more relaxed way.

The main groups of *unsupervised* methods are:

- dimensionality reduction methods
- clustering methods

The dimensionality reduction methods map data onto a lower-dimensional manifold while minimizing the lost information. The clustering methods partition the observations into groups with small difference between the group samples.

The *unsupervised* methods have many applications including smart feature selection, data visualization, efficient data encoding, adaptive classification and image segmentation. Last but not least example is Google's PageRank algorithm.

The K-means algorithm is a popular clustering approach. The goal is to find a clustering $\{\mathcal{T}_k\}^K_{k=1}$ of the observations that minimizes the *within-cluster sum of squares (WCSS)*:
$$\{\mathbf{c}_1, \ldots, \mathbf{c}_K; \mathcal{T}_1, \ldots, \mathcal{T}_K\} = \underset{\text{all }\mathbf{c}^{'}_k, \mathcal{T}^{'}_k} {\operatorname{arg\,min}} \sum_{k=1}^{K} \sum_{\mathbf{x} \in \mathcal{T}^{'}_k} \| \mathbf{x} - \mathbf{c}^{'}_k \|^2_2,$$
where $K$ is a user specified number of clusters, $\mathcal{T}_1,\ldots,\mathcal{T}_K$ are sets of observations belonging to clusters $1,\ldots,K$ and each cluster is represented by its prototype $\mathbf{c}_k$.

**The K-means algorithm**

**Input**: set of observations $\mathcal{T} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$, number of clusters $K$, maximum number of iterations.

**Output**: set of cluster prototypes $\{\mathbf{c}_1, \ldots, \mathbf{c}_K\}$.

**Initialize** $\{\mathbf{c}_1, \ldots, \mathbf{c}_K\}$ with random unique input observations $\mathbf{x}_i$.

**Do:**

*Classify*observations $\{\mathbf{x}_l\}^L_{l=1}$ to the cluster represented by the nearest prototype $\{\mathbf{c}_k\}^K_{k=1}$:

$$y_l = \arg\min_{k=1,\ldots,K}\|\mathbf{x}_l - \mathbf{c}_k\|^2_2 \qquad (\forall l = 1,\ldots,L),$$ $$ \mathcal{T}_k = \{\mathbf{x} \in \mathcal{T}: \forall l, y_l = k\} \qquad (\forall k = 1,\ldots,K) $$

*Update*the prototypes $\mathbf{c}_k$ to the mean value of all vectors $\mathbf{x}_l$ belonging to $k$-th class:

$$\mathbf{c}_k = \begin{cases}\frac{1}{|\mathcal{T}_k|}\sum_{\mathbf{x} \in \mathcal{T}_k} \mathbf{x} & \text{if}~|\mathcal{T}_k| > 0 \\ \text{re-initialize} & \text{otherwise} \end{cases}.$$

**Until** no change in clustering $\mathcal{T}_k$ or maximum number of iterations is exceeded.

The solution that K-means reaches often depends on the initialization of the means and there is no guarantee that it reaches the global optimum. As you can see in the two examples below, the solution is not unique and it can happen that the algorithm reaches a local minimum (as in the lower example), where reassigning any point to some other cluster would increase *WCSS*, but where a better solution does exist.

The chance of getting trapped in a local minimum can be reduced by performing more trials of the K-means clustering and choosing the result that minimizes *WCSS*.

K-Means++ is the K-means with clever initialization of cluster centers. The motivation is to make initializations which make K-means more likely to end up in a better local minima (a minima with a lower WCSS).

K-Means++ uses the following randomized sampling strategy for constructing the initial cluster centers set $\mathcal{C}$:

- Choose the first cluster center $\mathbf{c}_1$ uniformly at random from $\mathcal{T}$. Set $\mathcal{C} = \{\mathbf{c}_1\}$.
- For each data point $\mathbf{x}_l$, compute the distance $d_l$ to its nearest cluster in $\mathcal{C}$: $$ d_l = \min_{\mathbf{c}\in \mathcal{C}} \|\mathbf{x}_l - \mathbf{c} \| \qquad (\forall l=1,2,..., L) $$
- Select a point $\mathbf{x}_l$ from $\mathcal{T}$ with probability proportional to $d^2_l$. This involves constructing a distribution $p(l)$ from $d_l$ as $p(l) = \frac{d_l^2}{\sum_{l=1}^L d_l^2}$ and sampling from it to get the index $l$.
- $\mathcal{C} \leftarrow \mathcal{C} \cup \mathbf{x}_l$.
- Stop if $|\mathcal{C}| = K$, otherwise goto 2.

After this initialization, standard K-means algorithm is employed.

General information for Python development.

To fulfil this assignment, you need to submit these files (all packed in a single `.zip`

file) into the upload system:

- a script for data initialization, calling of the implemented functions and plotting of their results (for your convenience, will not be checked)`kmeans.ipynb`

- containing the following implemented methods:`kmeans.py`

- a function which implements the K-means algorithm`k_means`

- a function which performs several trials of the K-means clustering algorithm in order to avoid local minima`k_means_multiple_trials`

,`k_meanspp`

- functions for k-means++ initialization`random_sample`

- a function which implements color quantization based on K-means`quantize_colors`

,`random_initialization.png`

,`kmeanspp_initialization.png`

- images specified in the tasks`geeks_quantized.png`

** Use template of the assignment.** When preparing a zip file for the upload system, **do not include any directories**, the files have to be in the zip file root.

Your task is to implement the basic K-means algorithm as described in the box above. See the template for details.

# In order to obtain the same results, use this initialization of the random number generator # (don't put this into functions submitted to the upload system). np.random.seed(0) x = np.array([[16, 12, 50, 96, 34, 59, 22, 75, 26, 51]]) cluster_labels, centroids, sq_dists = k_means(x, 3, np.inf) print('cluster_labels: ', cluster_labels, '\ncentroids: ', centroids, '\nsq_dists: ', sq_dists) # -> cluster_labels: [1 1 0 0 2 0 1 0 1 0] # -> centroids: [[66 19 34]] # -> sq_dists: [ 9. 49. 256. 900. 0. 49. 9. 81. 49. 225.]

Limited number of iterations:

np.random.seed(0) images = np.load("data_kmeans.npz", allow_pickle=True)["images"] x = compute_measurements(images) _, centroids, _ = k_means(x, 3, 4) print('centroids:\n', centroids) # -> centroids: # -> [[ 330.84821429 -115.04545455 -4566.76 ] # -> [-2348.99107143 658.20454545 2375.78 ]]

Test the algorithm on the measurements extracted from the unlabelled data:

, which consists of images of three letters - H, L, T (here K = 3)`data_kmeans.npz`

- the MNIST dataset
of hand written numbers, K = 10`data_mnist_kmeans.npz`

Inspect the results.

Implement the function

which tries to avoid being trapped in local minimum by performing K-means several times and selecting the solution that minimizes the **k_means_multiple_trials(x, k, n_trials, max_iter)]]***WCSS*.

Implement clever K-means initialization: K-means++. The function

returns initial cluster prototypes for use in the already implemented K-means algorithm. You will first need to prepare **centroids = k_meanspp(x, k)**

function that picks randomly a sample based on the sample weights. See the following recipe and figure to illustrate the sampling process:
**idx = random_sample(weights)**

- normalize the sample weights to get a probability distribution over samples
- draw a random sample from the above probability distribution:
- generate a random number in the interval (0, 1)
- select corresponding sample

**Note**: use random_sample() with all weights set to 1 for choosing the first centroid.

Your implementation should return the same results:

np.random.seed(0) distances = np.array([10, 2, 4, 8, 3]) idx_1 = random_sample(distances) idx_2 = random_sample(distances) print('idx_1: ', idx_1, '\nidx_2: ', idx_2) # -> idx_1: 2 # -> idx_2: 3

K-means++ favors initialization with cluster prototypes far from each other:

np.random.seed(0) x = np.atleast_2d(np.array(range(100))) centroids = k_meanspp(x, 3) print('centroids: ', centroids) # centroids: [[54 82 15]]

Inspect the K-means++ initialization for the data generated by ** gen_kmeanspp_data** and compare it to a uniform random initialization of vanilla K-means. Save the figures of initializations to

`random_initialization.png`

`kmeanspp_initialization.png`

`random_sample`

Apply K-means algorithm to the color quantization problem. The aim is to reduce the number of distinct colors in an image such that the resulting image is visually close to the source image. The algorithm is as

- Reshape the input image to (N, 3) and convert it to
`np.float64`

- Randomly select 1000 pixels from the reshaped image (
`inds = np.random.randint(0, N-1, 1000)`

). - Perform K-means clustering of the selected pixels in the three-dimensional RGB space (using
function with`k_means`

, without`max_iter=float('inf')`

initialization).`k_meanspp`

- Assign all image pixels to the found means.
- convert the result to
`np.uint8`

**without**rounding it first.

You are supposed to implement a function

which performs the described color quantization. Run this function on **im_q = quantize_colors(im, k)**** geeks.png** image and save the resulting quantized version as

`geeks_quantized.png`

`k=8`

**Important (for automatic evaluation):** To generate indices of the 1000 random pixels, use this code: `inds = np.random.randint(0, (h_image * w_image)-1, 1000)`

.

1. What cluster shapes can be separated by the K-means algorithm the easiest?

- a) hypercube
- b) hyperellipsoid
- c) hypersphere
- d) hyperplane

c)

2. Which of the following datasets would be correctly clustered using K-means with parameter k=2?

a) No. With k = 2, the data are split by a separating hyperplane, but the clusters are not linearly separable here. b) No. With k = 2, the data are split by a separating hyperplane, but the clusters are not linearly separable here. c) Yes.

3. Suppose that your are using slightly different version of the K-means random initialization. The initial means are just randomly generated vectors. This works reasonably well, but you can end up with an empty cluster while iterating. Choose **two** best options how to deal with an empty cluster:

- a) split the smallest cluster into two
- b) split the biggest cluster into two
- c) reduce the number of clusters
- d) randomly initialize the new mean
- e) assign mean of the whole dataset as the new mean

a) No. What if the smallest cluster already contains only one data point? b) Yes. This works well in practice. c) No. We wouldn't get the K clusters we are searching for. d) Yes. This works well in practice. e) No. This would always pick the same new mean. What if we have two empty clusters?

4. The sum of squared distances of the points to their assigned cluster means:

- a) Always decreases with the number of iterations
- b) Always increases with the number of iterations
- c) Can be non-monotonic in certain situations

a) Yes. An important fact needed to prove the k-means algorithm convergence. b) No. We are trying to minimize this quantity. If it always increased, the algorithm would be pretty awful :). c) No. If the cost would not change, the algorithm terminates.

courses/be5b33rpz/labs/11_k-means/start.txt ยท Last modified: 2022/11/09 12:46 by neoramic