Week-4 | Summary

Week-4

1. Estimation: MLE 1.1. Example: Bernoulli 1.2. Gaussian 2. Estimation: Bayesian methods 2.1. Bernoulli with Beta prior 2.2. Point estimate 3. Gaussian Mixture Models 4. EM algorithm

1. Estimation: MLE Likelihood The likelihood of a dataset D under a distribution parameterized by 𝜃 is given below: L

(

𝜃;D

)

=n∏i=1P

(

;𝜃

)

• The likelihood is the "likelihood" of seeing the data if it is the result of drawing samples from the underlying distribution. • It takes this particular form if the points are assumed to be sampled independently and identically from the distribution. • The likelihood is a function of the parameter 𝜃. It should not be confused with a probability distribution. • P could be a PDF or a PMF depending on the whether x

is discrete or continuous. Log-likelihood l

(

𝜃;D

)

=n∑i=1

log

(

;𝜃

)

Since product of probabilities would result in a very small number, we move to log-space to avoid underflow. Maximizing the likelihood Estimate the parameter value that maximizes the likelihood:

max

𝜃 L

(

𝜃;D

)

Maximizing the log-likelihood Since

log

is a strictly increasing function, we can maximize the log-likelihood instead:

max

𝜃 l

(

𝜃;D

)

1.1. Example: Bernoulli Support

{

0,1

}

X=1 is equivalent to heads and X=0 is equivalent to tails. PMF P

(

X=x

)

(

1-p

)

1-x

A compact representation of P

(

X=1

)

=p and P

(

X=0

)

=1-p. Likelihood L

(

p;D

)

n∑i=1x

(

1-p

)

n∑i=1

(

1-x

)

Simplifies to L

(

p;D

)

(

1-p

)

where n

is number of heads and n

is number of tails. Note n

=n. Log-likelihood l

(

p;D

)

log

p+n

log

(

1-p

)

MLE for p p=

1.2. GaussianSupport R PDF f

(

x;𝜇,𝜎

)

𝜎

2𝜋

exp

[

-1

(

x-𝜇

𝜎

)

]

Likelihood L

(

𝜇,𝜎

)

=n∏i=1f

(

;𝜇,𝜎

)

Log-likelihood L

(

𝜇,𝜎

)

=n∑i=1

log

(

;𝜇,𝜎

)

MLE for 𝜇 𝜇=

n∑i=1x

2. Estimation: Bayesian methodsIn a Bayesian setting, probabilities are viewed as beliefs. Bayes Theorem The Bayes theorem is a tool that allows you to update your belief about a situation using data.

Posterior

Prior

Likelihood

Evidence

• The prior encodes your prior belief about the situation before observing the data (evidence).• The likelihood tells you how well the data conforms to your prior belief.• The likelihood is multiplied with the prior and normalized with the evidence to give the posterior, the updated belief.• The evidence is a normalizing factor here. In the context of parameter estimation, Bayes theorem takes this form: P

(

𝜃 | D

)

(

𝜃

)

⋅P

(

D | 𝜃

)

(

)

A note on the type of objects in the Bayes theorem: We will look at an example of Bayesian estimation for a binary dataset in

{

0,1

}

modeled using a Bernoulli distribution with a Beta prior. 2.1. Bernoulli with Beta prior Prior

Beta

(

𝛼,𝛽

)

𝛣

(

𝛼,𝛽

)

𝛼-1

(

1-p

)

𝛽-1

• 𝛼,𝛽>0 are parameters of the distribution• Support is

[

0,1

]

.• 𝛣

(

𝛼,𝛽

)

is a normalizing constant that ensures that f is a PDF A quick look at some of the possible shapes of the Beta distribution. Each one can model a different Likelihood The Bernoulli likelihood: p

(

1-p

)

Posterior

∝

Prior

Likelihood

Posterior

∝p

𝛼-1

(

1-p

)

𝛽-1

(

1-p

)

Posterior

Beta

(

𝛼+n

,𝛽+n

)

The Beta distribution is a conjugate prior for the Bernoulli likelihood. A conjugate prior has a similar form as the likelihood simplifying the computation of the posterior. 2.2. Point estimate Often we would want a point estimate (a single number) for the parameter. But Bayesian methods return a distribution over the parameter. We look at two ways to extract a point estimate:• expectation of the posterior• mode of the posterior For this example, the expected value of the posterior is:

𝛼+n

𝛼+𝛽+n

The mode of the posterior for 𝛼+n

>1,𝛽+n

>1 is:

𝛼+n

-1

𝛼+𝛽+n-2

The mode of the posterior is often called the Maximum A Posteriori estimate or MAP estimate, since the mode is nothing but the (arg)maximum of the posterior. 3. Gaussian Mixture ModelsFor more complex distributions, we have what is called a Gaussian Mixture Model. A GMM is a probability distribution. It is a mixture of K Gaussians, each of which is called a component. f

(

x;𝜃

)

=K∑k=1𝜋

(

x;𝜇

,𝜎

)

where n∑k=1𝜋

=1 so that f is a valid PDF. • f is the PDF of the GMM.• N is the PDF of a Gaussian distribution with mean 𝜇

and variance 𝜎

• 𝜋

is the "prior" contribution of the k

component and are called the mixture probabilities. A GMM is a latent variable model. That is, we can view the data-generation process by introducing a latent (hidden) variable z

for each data-point. Generating x

can be explained as follows: • First choose a component k by setting z

=k with prior probability 𝜋

• Sample a point from this Gaussian; the conditional density associated with this is N

(

x;𝜇

,𝜎

)

The joint density of seeing the point x

from component k becomes: f

(

X=x

,Z=k

)

=𝜋

(

;𝜇

,𝜎

)

Marginalizing over the random variable Z would give us the density: f

(

X=x

)

=K∑k=1f

(

X=x,Z=k

)

Leading us to: f

(

)

=K∑k=1𝜋

⋅N

(

;𝜇

,𝜎

)

This is the density of the GMM as seen before but explained using latent variables. Note that the latent variable is not explicitly observed. We posit that such a variable exists. Only the dataset D is observed. For a GMM with K components, we need to estimate 3K parameters:• K mixture probabilities• K means• K variances 4. EM algorithm We can use MLE to estimate the parameters. But we don't have a closed form solution. Thankfully, we have an iterative approach to parameter estimation called the EM algorithm. This algorithm makes use of some intermediate variables that help in parameter estimation: 𝜆

Points to note:• i: corresponds to index of the data-point• k: corresponds to index of the component• 𝜆

can be interpreted as a conditional probability– 0⩽𝜆

⩽1– n∑k=1𝜆

=1 for all i• There are KN such variables The parameters are collectively referred to as 𝜃=

[

𝜋,𝜇,𝜎

]

. We keep bettering our estimate of 𝜃 in each step. There are two steps in the algorithm: • E-step: update the values for 𝜆 using the current values of 𝜃• M-step: update the values of 𝜃 using the newly found values of 𝜆 Convergence criterion When successive iterates become smaller than some 𝜖 ||𝜃

(

t+1

)

-𝜃

(

)

||<𝜖 Initialization Use K-means algorithm to initialize 𝜃

[

𝜋

,𝜇

,𝜎

]

Until convergence E-step 𝜆

(

=k | X=x

)

𝜆

is the contribution of the k

component to the point x

given that we have observed the point x

. It represents the posterior probability of Z given X, that is, P

(

Z | X

)

. Using the Bayes' theorem:

𝜆ik	=P(zi=k)⋅P(X=xi \| zi=k) P(X=xi)

	=𝜋k⋅N(xi;𝜇k,𝜎2k) k∑j=1𝜋j⋅N(xi;𝜇j,𝜎2j)

Here we use the current values of 𝜋

,𝜇

,𝜎

to estimate 𝜆

. M-Step Use the values of 𝜆

obtained in the E-step to update the values of 𝜋

,𝜇

,𝜎

. 𝜇

n∑i=1𝜆

⋅x

n∑i=1𝜆

, 𝜎

n∑i=1𝜆

⋅

(

-𝜇

)

n∑i=1𝜆

𝜋

n∑i=1𝜆

Soft clustering EM algorithm can be seen as method that does soft-clustering. 𝜆

can be seen as the affinity of x

to component k. In K-means this affinity is binary -- a point belongs to a cluster or not. In the case of EM, this affinity is a number between

[

0,1

]

. The E-step is analogous to the cluster assignment step in K-means. The M-step is analogous to the updates for the cluster centers in K-means.