Week-4
1. Estimation: MLE 1.1. Example: Bernoulli 1.2. Gaussian 2. Estimation: Bayesian methods 2.1. Bernoulli with Beta prior 2.2. Point estimate 3. Gaussian Mixture Models 4. EM algorithm 1. Estimation: MLE Likelihood The likelihood of a dataset D under a distribution parameterized by 𝜃 is given below: L(𝜃;D)=ni=1P(xi;𝜃) The likelihood is the "likelihood" of seeing the data if it is the result of drawing samples from the underlying distribution. It takes this particular form if the points are assumed to be sampled independently and identically from the distribution. The likelihood is a function of the parameter 𝜃. It should not be confused with a probability distribution. P could be a PDF or a PMF depending on the whether xi is discrete or continuous. Log-likelihood l(𝜃;D)=ni=1logP(xi;𝜃) Since product of probabilities would result in a very small number, we move to log-space to avoid underflow. Maximizing the likelihood Estimate the parameter value that maximizes the likelihood:
max
𝜃
L(𝜃;D)
Maximizing the log-likelihood Since logis a strictly increasing function, we can maximize the log-likelihood instead:
max
𝜃
l(𝜃;D)
1.1. Example: Bernoulli Support {0,1} X=1 is equivalent to heads and X=0 is equivalent to tails. PMF P(X=x)=px(1-p)1-x A compact representation of P(X=1)=p and P(X=0)=1-p. Likelihood L(p;D)=pni=1xi(1-p)ni=1(1-xi)Simplifies to L(p;D)=pnh(1-p)nt where nh is number of heads and nt is number of tails. Note nh+nt=n. Log-likelihood l(p;D)=nhlogp+ntlog(1-p) MLE for p p=nh
n
1.2. GaussianSupport R PDF f(x;𝜇,𝜎2)=1
𝜎2𝜋
exp[-1
2
(x-𝜇
𝜎
)2]
Likelihood L(𝜇,𝜎2;D)=ni=1f(xi;𝜇,𝜎2) Log-likelihood L(𝜇,𝜎2;D)=ni=1logf(xi;𝜇,𝜎2) MLE for 𝜇 𝜇=ni=1xi
n
2. Estimation: Bayesian methodsIn a Bayesian setting, probabilities are viewed as beliefs. Bayes Theorem The Bayes theorem is a tool that allows you to update your belief about a situation using data. Posterior=Prior×Likelihood
Evidence
The prior encodes your prior belief about the situation before observing the data (evidence).The likelihood tells you how well the data conforms to your prior belief.The likelihood is multiplied with the prior and normalized with the evidence to give the posterior, the updated belief.The evidence is a normalizing factor here. In the context of parameter estimation, Bayes theorem takes this form: P(𝜃 | D)=P(𝜃)P(D | 𝜃)
P(D)
A note on the type of objects in the Bayes theorem: We will look at an example of Bayesian estimation for a binary dataset in {0,1}n modeled using a Bernoulli distribution with a Beta prior. 2.1. Bernoulli with Beta prior PriorBeta(𝛼,𝛽)=1
𝛣(𝛼,𝛽)
p𝛼-1(1-p)𝛽-1
𝛼,𝛽>0 are parameters of the distributionSupport is [0,1].𝛣(𝛼,𝛽) is a normalizing constant that ensures that f is a PDF A quick look at some of the possible shapes of the Beta distribution. Each one can model a different Likelihood The Bernoulli likelihood: pnh(1-p)nt Posterior PosteriorPrior×Likelihood Posteriorp𝛼-1(1-p)𝛽-1pnh(1-p)nt Posterior=Beta(𝛼+nh,𝛽+nt) The Beta distribution is a conjugate prior for the Bernoulli likelihood. A conjugate prior has a similar form as the likelihood simplifying the computation of the posterior. 2.2. Point estimate Often we would want a point estimate (a single number) for the parameter. But Bayesian methods return a distribution over the parameter. We look at two ways to extract a point estimate:expectation of the posteriormode of the posterior For this example, the expected value of the posterior is: 𝛼+nh
𝛼+𝛽+n
The mode of the posterior for 𝛼+nh>1,𝛽+nt>1 is: 𝛼+nh-1
𝛼+𝛽+n-2
The mode of the posterior is often called the Maximum A Posteriori estimate or MAP estimate, since the mode is nothing but the (arg)maximum of the posterior. 3. Gaussian Mixture ModelsFor more complex distributions, we have what is called a Gaussian Mixture Model. A GMM is a probability distribution. It is a mixture of K Gaussians, each of which is called a component. f(x;𝜃)=Kk=1𝜋kN(x;𝜇k,𝜎2k) where nk=1𝜋k=1 so that f is a valid PDF. f is the PDF of the GMM.N is the PDF of a Gaussian distribution with mean 𝜇k and variance 𝜎2k𝜋k is the "prior" contribution of the kth component and are called the mixture probabilities. A GMM is a latent variable model. That is, we can view the data-generation process by introducing a latent (hidden) variable zi for each data-point. Generating xi can be explained as follows: First choose a component k by setting zi=k with prior probability 𝜋kSample a point from this Gaussian; the conditional density associated with this is N(x;𝜇k,𝜎2k) The joint density of seeing the point xi from component k becomes: f(X=xi,Z=k)=𝜋kN(xi;𝜇k,𝜎2k) Marginalizing over the random variable Z would give us the density: f(X=xi)=Kk=1f(X=x,Z=k)Leading us to: f(xi)=Kk=1𝜋kN(xi;𝜇k,𝜎2k) This is the density of the GMM as seen before but explained using latent variables. Note that the latent variable is not explicitly observed. We posit that such a variable exists. Only the dataset D is observed. For a GMM with K components, we need to estimate 3K parameters:K mixture probabilitiesK meansK variances 4. EM algorithm We can use MLE to estimate the parameters. But we don't have a closed form solution. Thankfully, we have an iterative approach to parameter estimation called the EM algorithm. This algorithm makes use of some intermediate variables that help in parameter estimation: 𝜆ik Points to note:i: corresponds to index of the data-pointk: corresponds to index of the component𝜆ik can be interpreted as a conditional probability0𝜆ik1nk=1𝜆ik=1 for all iThere are KN such variables The parameters are collectively referred to as 𝜃=[𝜋,𝜇,𝜎]. We keep bettering our estimate of 𝜃 in each step. There are two steps in the algorithm: E-step: update the values for 𝜆 using the current values of 𝜃M-step: update the values of 𝜃 using the newly found values of 𝜆 Convergence criterion When successive iterates become smaller than some 𝜖 ||𝜃(t+1)-𝜃(t)||<𝜖 Initialization Use K-means algorithm to initialize 𝜃k=[𝜋k,𝜇k,𝜎k].
Until convergence E-step 𝜆ik=P(zi=k | X=xi) 𝜆ik is the contribution of the kth component to the point xi given that we have observed the point xi. It represents the posterior probability of Z given X, that is, P(Z | X). Using the Bayes' theorem: a
𝜆ik=P(zi=k)P(X=xi | zi=k)
P(X=xi)
=𝜋kN(xi;𝜇k,𝜎2k)
kj=1𝜋jN(xi;𝜇j,𝜎2j)
Here we use the current values of 𝜋k,𝜇k,𝜎2k to estimate 𝜆ik. M-Step Use the values of 𝜆ik obtained in the E-step to update the values of 𝜋k,𝜇k,𝜎2k. 𝜇k=ni=1𝜆ikxi
ni=1𝜆ik
, 𝜎2k=ni=1𝜆ik(xi-𝜇k)2
ni=1𝜆ik
𝜋k=ni=1𝜆ik
n
Soft clustering EM algorithm can be seen as method that does soft-clustering. 𝜆ik can be seen as the affinity of xi to component k. In K-means this affinity is binary -- a point belongs to a cluster or not. In the case of EM, this affinity is a number between [0,1]. The E-step is analogous to the cluster assignment step in K-means. The M-step is analogous to the updates for the cluster centers in K-means.