Week-3
1. Clustering 2. Optimization problem 3. LLoyd's Algorithm 4. Voronoi Regions 5. Other considerations 6. K-means++ 1. Clustering Cluster membership Assuming K clusters: z{1,,K}n z=a
z1
zn
Total number of cluster assignments possible kkn points=knIndicator function I(cond)=a
1,cond is true
0,cond is false
Cluster means For cluster-k: 𝜇kRd 𝜇k=ni=1I(zi=k)xi
ni=1I(zi=k)
Cluster mean is the mean of the points assigned to it. 2. Optimization problem Distance of xi to the mean of the cluster to which it is assigned, 𝜇zi: ||xi-𝜇zi|| Objective function Data-point view f(D,z)=ni=1||xi-𝜇zi||2 Cluster-view f(D,z)=Kk=1ni=11(zi=k)||xi-𝜇k||2 Solve
min
z{1,⋯,K}n
f(D,z)
The search space is combinatorial and hence this is can be very hard to optimize exactly. 3. LLoyd's AlgorithmThis is an iterative algorithm that tries to arrive at a reasonably good solution. It is also called the k-means algorithm. The algorithm always convergence and the criterion is given below. Convergence criterion z(t+1)=z(t) Cluster assignment at the end of iteration-t z(t){1,,K}n Mean of cluster k at the end of iteration-t 𝜇(t)kRd Initialization z(0)random vector 𝜇(0)k=ni=11(z(0)i=k)xi
ni=11(z(0)i=k)
Until convergence, update Cluster membership z(t+1)i=
argmin
k∈{1,⋯,K}
||xi-𝜇(t)k||2
In the case of a tie between multiple clusters for xi, do not shift allegiance, that is, retain the cluster assignment at time-step t by setting z(t+1)i:=z(t)i. Cluster means 𝜇(t+1)k=ni=11(z(t+1)i=k)xi
ni=11(z(t+1)i=k)
If it takes T time steps for the algorithm to converge, we can make the following observations:z(0),,z(T) are the sequence of assignmentsz(T)=z(T-1) for convergenceAll cluster assignments from z(0) to z(T-1) are distinct. That is, cluster assignments never repeat.f(D,z(T))=f(D,z(T-1))<⋯<f(D,z(t))<⋯<f(D,z(0)) 4. Voronoi Regions Given two clusters indexed by r and s, the set of all points closer to cluster s than cluster r is given by the following half-plane:{xRd:||x-𝜇s||2||x-𝜇r||2}This separating plane passes through the mid-point of the line segment joining the means, 𝜇s+𝜇r
2
, and is perpendicular to 𝜇r-𝜇s [this is the perpendicular bisector (a line) in R2]
Each cell is formed by the intersection of K-1 such half-planes, obtained by comparing cluster s with the remaining K-1 clusters.The cell that is formed in this manner is convex since:half-planes are convex setsthe intersection of convex sets is convex 5. Other considerations Lloyd's algorithm is deterministic. Given an initial cluster assignment, it will always return the same final clusters.k is a hyperparameter and must be chosen appropriately. A hyperparameter is different from a parameter in that it is not "learnt from data" but is chosen before the learning begins. One heuristic to choose a value of k is the elbow method.Initialization plays a key role in obtaining good clusters. Pathological initialization could lead to very poor clusters. In practice, for a given k, multiple runs of K-means with different initializations are performed. The run which yields the smallest objective function value is chosen. 6. K-means++K-means++ is an algorithm that provides a sounder initialization which results in some convergence guarantees. The algorithm is probabilistic in nature. We will describe the algorithm for K=3. Step-1: Choose any of the n data-points as the first mean 𝜇1 by sampling uniformly at random. For convenience, we will relabel the sampled mean as x1. That is, if x5 is sampled, swap the labels for x1 and x5. P(𝜇1=x1)=1
n
Step-2: Choose the second mean as one of the remaining data-points using this process d(xi,𝜇1)=distance of xi from mean 𝜇1 The score for each data-point is the squared distance: s(xi)=d(xi,𝜇1)2 A probability distribution over the n-1 points. Recall that the summation starts from 2 since x1 has been chosen. P(xi)=s(xi)
nj=2s(xj)
Sample a data-point using this distribution and assign it to 𝜇2. For convenience, relabel the sampled point x2. That is, if x5 is sampled, swap the labels and the scores for x2 and x5. P(𝜇2=x2 | 𝜇1=x1)=s(x2)
nj=2s(xj)
Step-3: Choose the third mean as one of the remaining data-points using this process d(xi,𝜇j)=distance of xi from mean 𝜇j The score for each data-point is the squared distance of the distance of xi to the closest mean: s(xi)=
min
(d(xi,𝜇1),d(xi,𝜇2))2
A probability distribution over the n-2 points. Recall that the summation starts from 3 since x1 and x2 have been chosen. P(xi)=s(xi)
nj=3s(xj)
Sample a data-point using this distribution and assign it to 𝜇3. For convenience, relabel the sampled point x3. That is, if x5 is sampled, swap the labels and the scores for x3 and x5. P(𝜇3=x3 | 𝜇1=x1,𝜇2=x2)=s(x3)
nj=3s(xj)
The overall probability of choosing these three means in this order is the probability of the three probabilities we have computed so far. For K>3, the algorithm can be extended in a similar fashion.