2nāi=1[f(xi)-yi]2⢠yi ā true label for xi⢠f(xi)āpredicted label⢠f(xi)-yiāerror in predictionThis is called the sum of squared errors or SSE. The 0.5 is added for mathematical convenience. Dividing this by n gives the mean squared error or MSE:L(f,D)=1
2nnāi=1[f(xi)-yi]2The task is to find a function f that minimizes the loss function on the training dataset while generalizing well to unseen data-points. Since the class of functions is so huge, we usually choose a more tractable set.Almost any ML algorithm can be understood in terms of four components:⢠Model⢠Loss function⢠Optimization⢠EvaluationWe will study a specific approach called linear regression.3. Linear Regression3.1. ModelWhen f is assumed to be a linear function, the resulting solution approach is called linear regression:f(x)=wTx=w1x1+āÆ+wdxdwhere wāRd is called the weight vector. We also call f a linear model. For a given data-point-label pair (xi,yi), f(xi) is called the predicted label and yi is the true label. For all data-points, the predicted label vector can be expressed as:y=XTwExpanded as:y=a
xT1w
ā®
xTnw
where, yāRn.3.2. Loss FunctionThe SSE loss function with a linear model expressed in three different but equivalent forms:Vector-scalar formL(w)=1
2nāi=1[wTxi-yi]2Vector-vector formIf y=XTw, we have:L(w)=1
2||y-y||2Matrix-vector formL(w)=1
2||XTw-y||2An visualization of the squared loss and its contours for wāR2:L(w1,w2)w1w2L(w)=(w1-1)2+(w2+1)2+1w=a
1
-1
⢠The height of the surface above the w1-w2 plane is the value of the loss function.⢠The SSE loss function is convex.3.3. OptimizationThe task is to find a w that minimizes the loss:w=
min
w1
2nāi=1[wTxi-yi]2The SSE loss function L(w) is convex. Hence, all its local minima are global minima.3.3.1. GradientThe gradient of the SSE is given as follows:āL(w)=a
āL(w)
āw1
ā®
āL(w)
āwd
Vector-scalar formāL(w)=nāi=1[wTxi-yi]xiMatrix-vector formāL(w)=XXTw-Xy3.3.2. Normal equationsThe system linear equations equations below are called the normal equations:XXTw=XyThis system is consistent for any choice of X and y. One special solution is given by:w=(XT)ā y
Why do we need the pseudo-inverse? (out of syllabus, can be skipped):Consider the system:Ax=bIf Ax=b is not consistent, you can still hunt for an approximate solution by looking for x that minimizes ||Ax-b||2. In other words, we are trying to get as close to b as possible:AxābThe resulting solution is given using the pseudo-inverse of A:x=Aā bIn the linear regression setting, A=XT,x=w,b=y:XTwāyThe solution is given by:w=(XT)ā yIf rank(X)=d, then the pseudo-inverse of XT can be written as:(XT)ā =(XXT)-1XNote that the pseudo-inverse is defined for all matrices. For more details, refer to Moore Penrose inverse. If Ax=b is consistent, Aā b gives the solution that has the least norm.
(XT)ā is the pseudo-inverse of XT and is a dĆn matrix. w lies in the span of the data-points.If rank(X)<d, there are infinitely many solutions. Out of all of them, w is the solution that has the least norm. Any solution to the system can be written as w+wā, where wā is perpendicular to the span of the data-points since:XXT(w+wā)=XySpecifically, if rank(X)=d the solution is unique and is given by:w=(XXT)-1Xyw is sometimes termed as the least squares solution, since it is obtained by minimizing the SSE.3.4. Gradient descentThe optimum weight vector can also be found using gradient descent. If w(t) represents the weight vector at time t, the update is given by:w(t+1)=w(t)-šāL(w)|w(t)Substituting the gradient:w(t+1)=w(t)-š(XXTw(t)-Xy)š>0 is the learning rate. Since the objective is convex, gradient descent is guaranteed to converge for suitable choices of the learning rate š. š is called a hyperparameter (something that you choose).3.5. Stochastic Gradient DescentFor large datasets, computing the entire gradient in one go might be costly. In such cases, a random subset (random sample) of the dataset is extracted. The approximate gradient from this subset is used in the update equation. If Xā²āRdĆm represents a sample of m points:w(t+1)=w(t)-š[(Xā²)(Xā²)Tw(t)-Xā²y]The progress of SGD is less smooth given that we are only using an approximate version the gradient. Eventually both reach the minimum.
Gradient DescentStochastic Gradient Descent3.6. EvaluationThe model's performance is evaluated on a test dataset which is a collection of data-points not seen during training. A good performance metric is the root mean squared error (RMSE):1
nnāi=1(wTxi-yi)2This has the same units as the label which the error more interpretable.4. Geometric Perspective4.1. Best-fit surfaceThe geometry of the relationship between predicted label and features in linear regression takes the following form in different dimensions:
yxy=wxyx1x2y=w1x1+w2x2Each blue point here represents (xi,yi), where yi is the true label. Note that the point (xi,yi) need not lie on the hyperplane. However, the point (xi,yi) will lie on the hyperplane.4.2. ProjectionsFeature vectorsIf all the feature values for a given feature are added as components of a vector, we get a feature vector. For example, rj will have the values for the jth feature for all data-points:rjāRnWe can now look at X in two ways:Columns made up of data-pointsX=a
|
|
x1
āÆ
xn
|
|
,XT=a
-
xT1
-
-
xn
-
Rows made up of feature vectorsX=a
-
rT1
-
ā®
-
rTd
-
,XT=a
|
|
r1
āÆ
rd
|
|
The row space of X, which is the same as the column space of XT, is the span of the feature vectors. The SSE objective can be understood as follows. Find a point in the row space of X that is closest to y:
min
w||XTw-y||2
yXTwRowspace of XRnThe required point is the projection of y on the row space of X. This means that y-XTw is orthogonal to the row space of X:X(y-XTw)=0This leads us back to the normal equations, XXTw=Xy. The projection of y onto the row space of X is given by (XT)ā y.5. Probabilistic PerspectiveWe assume that yi is generated by adding a noise term to wTxi:yi=wTxi+šiAssuming šiā¼N(0,š2), we have:yi|xiā¼N(wTxi,š2)We can now use MLE to estimate w.LikelihoodL(w)ānāi=1exp[-(yi-wTxi)2
2š2]Log-likelihoodAfter collecting terms that do not affect the optimization in c:l(w)=-1
2š2nāi=1(yi-wTxi)2+cMaximizing the log-likelihood is the same as minimizing the negative log-likelihood. Therefore we end up with the same optimization problem that we saw before:
min
wnāi=1(yi-wTxi)2Minimizing the SSE is equivalent to maximizing the likelihood under a zero mean, Gaussian noise assumption.6. Kernel Regression6.1. LearningConsider the non-linear transformation of the features:š:RdāRDThe transformed dataset is:{(š(x1),y1),āÆ,(š(xn),yn)}The transformed data-matrix is:š(X)āRDĆnWe can now perform regression in the transformed feature space. The model is:f:RDāRf(x)=wTš(x)The loss function becomes:L(w)=||š(X)Tw-y||2Since w lies in the span of the data-points, for coefficient vector š¼āRnw=š(X)š¼=a
|
|
š(x1)
āÆ
š(xn)
|
|
a
š¼1
ā®
š¼n
The loss now becomes:L(w)=||š(X)Tš(X)š¼-y||2If the kernel associated with š is:k:RdĆRdāRthe kernel matrix KāRnĆn is:K=š(X)Tš(X)which is computed element-wise asKij=k(xi,xj)With this, the loss function's final form is:L(w)=||Kš¼-y||2Minimizing the L would result in:š¼=Kā yIf K is invertible, we haveš¼=K-1yWith this, w=š(X)š¼, which cannot be explicitly known since š is not explicitly computed.6.2. PredictionFor a test data-point xtest, the predicted label is:y=š(xtest)TwUsing w=š(X)š¼, we have:y=š(xtest)Tš(X)š¼The product before š¼ is a 1Ćn vector:š(xtest)Ta
|
|
š(x1)
āÆ
š(xn)
|
|
=a
š(xtest)Tš(x1)
āÆ
š(xtest)Tš(xn)
Expanding the RHS of the predicted label:y=nāi=1š¼iā k(xtest,xi)⢠š¼i: importance of ith data-point in predicting the label for xtest⢠k(xtest,xi): similarity between xtest and xi