Week-3 | Summary

Week-3

1. Clustering 2. Optimization problem 3. LLoyd's Algorithm 4. Voronoi Regions 5. Other considerations 6. K-means++

1. Clustering Cluster membership Assuming K clusters: z∈

{

1,⋯,K

}

⋮

Total number of cluster assignments possible k⋯k

⏠

⏣

⏡

⏣

⏢

points

Indicator function I

(

cond

)

1,		cond is true
0,		cond is false

Cluster means For cluster-k: 𝜇

∈R

𝜇

n∑i=1I

(

)

⋅x

n∑i=1I

(

)

Cluster mean is the mean of the points assigned to it. 2. Optimization problem Distance of x

to the mean of the cluster to which it is assigned, 𝜇

: ||x

-𝜇

|| Objective function Data-point view f

(

D,z

)

=n∑i=1||x

-𝜇

Cluster-view f

(

D,z

)

=K∑k=1n∑i=11

(

)

⋅||x

-𝜇

Solve

min

z∈

{

1,⋯,K

}

(

D,z

)

The search space is combinatorial and hence this is can be very hard to optimize exactly. 3. LLoyd's AlgorithmThis is an iterative algorithm that tries to arrive at a reasonably good solution. It is also called the k-means algorithm. The algorithm always convergence and the criterion is given below. Convergence criterion z

(

t+1

)

(

)

Cluster assignment at the end of iteration-t z

(

)

∈

{

1,⋯,K

}

Mean of cluster k at the end of iteration-t 𝜇

(

)

∈R

Initialization z

(

)

→

random vector

𝜇

(

)

n∑i=11

(

)

n∑i=11

(

)

Until convergence, update Cluster membership z

(

t+1

)

arg

min

k∈

{

1,⋯,K

}

||x

-𝜇

(

)

In the case of a tie between multiple clusters for x

, do not shift allegiance, that is, retain the cluster assignment at time-step t by setting z

(

t+1

)

:=z

(

)

. Cluster means 𝜇

(

t+1

)

n∑i=11

(

t+1

)

n∑i=11

(

t+1

)

If it takes T time steps for the algorithm to converge, we can make the following observations:• z

(

)

,⋯,z

(

)

are the sequence of assignments• z

(

)

(

T-1

)

for convergence• All cluster assignments from z

(

)

to z

(

T-1

)

are distinct. That is, cluster assignments never repeat.• f

(

D,z

(

)

(

D,z

(

T-1

)

<⋯<f

(

D,z

(

)

<⋯<f

(

D,z

(

)

4. Voronoi Regions • Given two clusters indexed by r and s, the set of all points closer to cluster s than cluster r is given by the following half-plane:

{

x∈R

:||x-𝜇

⩽||x-𝜇

}

• This separating plane passes through the mid-point of the line segment joining the means,

𝜇

+𝜇

, and is perpendicular to 𝜇

-𝜇

[this is the perpendicular bisector (a line) in R

]• Each cell is formed by the intersection of K-1 such half-planes, obtained by comparing cluster s with the remaining K-1 clusters.• The cell that is formed in this manner is convex since:– half-planes are convex sets– the intersection of convex sets is convex 5. Other considerations • Lloyd's algorithm is deterministic. Given an initial cluster assignment, it will always return the same final clusters.• k is a hyperparameter and must be chosen appropriately. A hyperparameter is different from a parameter in that it is not "learnt from data" but is chosen before the learning begins. One heuristic to choose a value of k is the elbow method.• Initialization plays a key role in obtaining good clusters. Pathological initialization could lead to very poor clusters. In practice, for a given k, multiple runs of K-means with different initializations are performed. The run which yields the smallest objective function value is chosen. 6. K-means++K-means++ is an algorithm that provides a sounder initialization which results in some convergence guarantees. The algorithm is probabilistic in nature. We will describe the algorithm for K=3. Step-1: Choose any of the n data-points as the first mean 𝜇

by sampling uniformly at random. For convenience, we will relabel the sampled mean as x

. That is, if x

is sampled, swap the labels for x

and x

. P

(

𝜇

)

Step-2: Choose the second mean as one of the remaining data-points using this process d

(

,𝜇

)

distance of

from mean

𝜇

The score for each data-point is the squared distance: s

(

)

(

,𝜇

)

A probability distribution over the n-1 points. Recall that the summation starts from 2 since x

has been chosen. P

(

)

(

)

n∑j=2s

(

)

Sample a data-point using this distribution and assign it to 𝜇

. For convenience, relabel the sampled point x

. That is, if x

is sampled, swap the labels and the scores for x

and x

. P

(

𝜇

| 𝜇

)

(

)

n∑j=2s

(

)

Step-3: Choose the third mean as one of the remaining data-points using this process d

(

,𝜇

)

distance of

from mean

𝜇

The score for each data-point is the squared distance of the distance of x

to the closest mean: s

(

)

min

(

,𝜇

)

(

,𝜇

)

A probability distribution over the n-2 points. Recall that the summation starts from 3 since x

and x

have been chosen. P

(

)

(

)

n∑j=3s

(

)

Sample a data-point using this distribution and assign it to 𝜇

. For convenience, relabel the sampled point x

. That is, if x

is sampled, swap the labels and the scores for x

and x

. P

(

𝜇

| 𝜇

,𝜇

)

(

)

n∑j=3s

(

)

The overall probability of choosing these three means in this order is the probability of the three probabilities we have computed so far. For K>3, the algorithm can be extended in a similar fashion.