Remark: It is important to note the final goal in any supervised learning problem is good performance on the test dataset. Note that the test dataset is not available during training. For this reason, it is often called a set of unseen data-points. A "good" supervised ML model is said to generalize well to unseen data-points. Here, the term generalization denotes the ability of the model to capture the real patterns in the data, while ignoring the noise, which in turn gives it the power to perform well on points it has not seen before.In our course, we will spend most of the time with the training data. The training data is the raw material for learning the model. It is important to acquire a good working knowledge of the various techniques to train a model. That said, the learner is encouraged to keep the importance of the test dataset in the back of his mind at all points of time.
2. RegressionIf f:RdāR is a potential mapping (model) from data-points to labels, we use the following metric to measure the goodness of the model.L(f,D)=1
2nāi=1(f(xi)-yi)2⢠yi ā true label for xi⢠f(xi)ā predicted label⢠f(xi)-yiā error in predictionThis is called the sum of squared errors or SSE. The 1/2 is added for mathematical convenience. This is particularly useful when we differentiate the loss function with respect to the parameters of the model. Dividing the SSE by n gives the mean squared error or MSE:L(f,D)=1
2nnāi=1(f(xi)-yi)2The task is to find a function f that minimizes the loss function on the training dataset while generalizing well to unseen data-points. Since the class of all possible functions is so huge, we usually choose a more tractable set: the class of linear functions. Hence, the approach we will study now is called linear regression.3. Linear Regression3.1. ModelWhen f is assumed to be a linear function, the resulting solution approach is called linear regression. Consider f:RdāR such that:f(x)=wTx=w1x1+āÆ+wdxdwhere wāRd is called the weight vector. We call f a linear model. For a given data-point-label pair (xi,yi), f(xi) is called the predicted label and yi is the true label. For all data-points, the predicted label vector can be expressed as:y=XTwTo see why, expand the RHS:y=a
xT1w
ā®
xTnw
where, yāRn.3.2. Loss FunctionThe SSE loss function with a linear model can be expressed in three different but equivalent forms:Vector-scalar formL(w)=1
2nāi=1(wTxi-yi)2Vector-vector formIf y=XTw, we have:L(w)=1
2||y-y||2Matrix-vector formL(w)=1
2||XTw-y||2A visualization of the squared loss and its contours for wāR2 is given below. This is called the parameter space. In the parameter space, the loss function is a surface. As we move along the w1-w2 plane, we see that the loss function's value changes. Note that the loss function depends on the dataset as well as the model parameters. Here is an example loss surface.L(w1,w2)w1w2L(w)=(w1-1)2+(w2+1)2+1w=a
1
-1
The height of the surface above the w1-w2 plane is the value of the loss function.
Convexity of SSE: The SSE loss function is convex. To see why, consider the function again:L(w)=1
2nāi=1(wTxi-yi)2Each term is of the form (wTxi-yi)2. First, note that wTxi-yi is convex since affine functions are convex. Next, (wTxi-yi)2 is convex. This is because a composition of a convex function and an affine function is convex. Finally, the sum of all these convex functions is again convex.
The convexity of SSE makes life very simple. If the function has a local minimum, then it has to be global.3.3. OptimizationThe task is to find a w that minimizes the loss:w=
min
w1
2nāi=1(wTxi-yi)2As stated before, the SSE loss function, L(w), is convex. Hence, all its local minima are global minima. In addition, since L(w) is differentiable, if it has a minimum, then āL(w)=0 has to hold at that point. So the next step is to compute the gradient.3.3.1. GradientThe gradient is the vector of partial derivatives:āL(w)=a
āL(w)
āw1
ā®
āL(w)
āwd
Let us now compute this for one term in the loss function:a
ā(wTxi-yi)2
=2ā (wTxi-yi)ā xi
Summing this over the n data-points:a
āL(w)
=1
2ā nāi=12ā (wTxi-yi)ā xi
=nāi=1(wTxi-yi)xi
Let us now express this in two different but equivalent forms:Vector-scalar formāL(w)=nāi=1(wTxi-yi)xiMatrix-vector formPlaying around with the previous equation, we get:a
āL(w)
=nāi=1(wTxi-yi)xi
=nāi=1xi(xTiw-yi)
=[nāi=1xixTi]w-nāi=1xiyi
=XXTw-Xy
This leads us to another equivalent equation for the gradient:āL(w)=XXTw-Xy3.3.2. Normal equationsSetting the gradient to zero, we get:XXTw=XyThis system linear equations equations are called the normal equations. It is consistent for any choice of X and y. One special solution is given by:w=(XT)ā ywhere (XT)ā is called the pseudo-inverse of XT.
What is the pseudo-inverse and why do we need it?The pseudo-inverse of a matrix A of shape mĆn is a unique matrix Aā of shape nĆm that behaves like an inverse. The pseudo-inverse is defined for all matrices, unlike the inverse which is defined for only some square matrices. The pseudo-inverse is intimately tied to a system of linear equations. Consider the system:Ax=bIf Ax=b is not consistent, one can still hunt for an approximate solution by looking for x that minimizes ||Ax-b||2. Since Ax is a vector in the column space of A, we are searching for a vector in the column space of A that is as close to b as possible:AxābIt turns out that one of the possible solutions for x is given using the pseudo-inverse of A:x=Aā bIn the linear regression setting, A=XT,x=w,b=y. We are looking for a parameter vector w so that the predicted label vector, XTw, is as close as possible to the true label vector, y, that is:XTwāyThe solution is given by:w=(XT)ā yIf rank(X)=d, then the pseudo-inverse of XT can be written as:(XT)ā =(XXT)-1XFor more details, refer to Moore Penrose inverse. If Ax=b is consistent, Aā b gives the solution that has the least norm.
(XT)ā is the pseudo-inverse of XT and is a dĆn matrix. It turns out that w lies in the span of the data-points. This is a key observation and will be used when we discuss kernel regression.If rank(X)<d, there are infinitely many solutions. Out of all of them, w is the solution that has the least norm. Any solution to the system can be written as w+wā, where wā is perpendicular to the span of the data-points since:XXT(w+wā)=XySpecifically, if rank(X)=d the solution is unique and is given by:w=(XXT)-1Xyw is often termed as the least squares solution, since it is obtained by minimizing the SSE.3.4. Gradient descentXXT is a dĆd matrix. If d is a large value, then computing the inverse of XXT is going to be a costly operation. Rather than finding a closed form solution that involves the inverse, we can settle for an iterative approach using gradient descent. If w(t) represents the weight vector at time t, the update is given by:w(t+1)=w(t)-šāL(w)|w(t)Substituting the gradient:w(t+1)=w(t)-š(XXTw(t)-Xy)š>0 is the learning rate. Since the objective is convex, gradient descent is guaranteed to converge to the global minimum for suitable choices of the learning rate š. š is called a hyperparameter (something that you choose). w(0)=0 is one of the choices for the initial weight vector. However, it is quite common to choose w(0) randomly, say by sampling a vector of length d from a standard normal Gaussian distribution or a uniform distribution. In our course, unless otherwise stated, we will stick to a zero initialization.Visually, we can see the progress of the algorithm on the parameter space. In the image below, we have the contours of the loss function for the case of d=2. The individual dots represent the iterates w(t) for various values of t:
w1w2Dark green indicates early stage of the optimization process, yellow indicates the intermediate stage and red indicates the final stage of the process as the iterates get close to the minimum. In this example, we see the progress from w(0)=(0,0) to w(T)=(-2,5). An important observation here is the direction of the trajectory and the angle it makes with the contours along the way. Notice how the trajectory seems to be perpendicular to the contours at every point. This is not a coincidence. Recall that the gradient of a function at a point is orthogonal to the contours at that point.
Remark: The gradient for a single data-point has a nice interpretation. Consider the gradient expression again:(wTxi-yi)xiThis can be rewritten as(yi-yi)xiIf we treat yi-yi as the error in prediction, then the gradient of the loss with respect to the weight vector for a single data-point assumes the form:errorĆdata-pointWhat this means is that, a data-point's contribution to the weight update depends on two things:⢠The magnitude depends on how well the model is performing currently on that data-point. The poorer the performance, higher the error, bigger the contribution. ⢠The direction of the contribution is parallel (anti-parallel) to the data-point.
3.5. Stochastic Gradient DescentFor large datasets, computing the entire gradient in one go might be costly. To see why the size of the dataset matters, recall that the gradient is given by:nāi=1(wTxi-yi)xiAs n increases, the time taken to compute the gradient on the entire dataset increases. In such cases, a random subset (random sample) of the dataset is extracted and the gradient on this subset is used to approximate the full gradient. If Xā²āRdĆm represents a sample of m points, then the update equation for the approximate setting becomes:w(t+1)=w(t)-š[(Xā²)(Xā²)Tw(t)-Xā²y]This algorithm is called stochastic gradient descent (SGD). The qualifier stochastic arises from the randomness in picking the subset of points for the update. The progress of SGD is less smooth compared to GD given that we are only using an approximate version the gradient. Eventually, SGD also reaches the local minimum under suitable choices of the learning rate.
Gradient DescentStochastic Gradient DescentNotice how the trajectory in the case of SGD is no longer orthogonal to the contours. This is because, the gradient in the case of SGD is the approximate gradient and not the full gradient of the loss function.In practice, how does SGD work? First we divide the n data-points into a set of batches, each having a small number of data-points. For example, if we have 320 data-points, we might divide into 10 batches, each having 32 data-points. The batch-size is denoted as the number of data-points in a batch, which is 32 in this example. Next, we compute the gradient for each batch and use that in the update rule for SGD. One complete pass over the training dataset is called an epoch. That is, in an epoch, every data-point is used exactly once in some update. In our example, in one epoch, we will update the parameters 10 times, once for each batch. Before every epoch begins, it is also important to shuffle the data-points. This makes the learning more robust. The need for shuffling the dataset will be taken up when we discuss SGD in the context of classification.In general, if we have a training dataset of n data-points and a batch-size B, then one epoch will witness n
B updates to the parameters, where n
B is the number of batches. Visually:
n data-pointsāÆBBBBābatch-size3.6. EvaluationThe model's performance is evaluated on a test dataset which is a collection of data-points not seen during training. As stated before, a good score on the test dataset is the goal of any supervised ML problem. A good performance metric in the case of regression is the root mean squared error (RMSE):1
nnāi=1(wTxi-yi)2This has the same units as the label which makes the error more interpretable. The factor of 1/2 can be omitted from here since the RMSE isn't used in an optimization context.4. Geometric PerspectiveWe can view the regression problem from the standpoint of two spaces: Rd+1 and Rn.4.1. Best-fit surface, Rd+1 viewThe geometry of the relationship between predicted label and features in linear regression takes the following form in different dimensions:
yxy=wxyx1x2y=w1x1+w2x2Each blue point here represents (xi,yi), where yi is the true label. Note that the point (xi,yi) need not lie on the hyperplane. However, the point (xi,yi) will lie on the hyperplane. In general, for d features, the feature-label space is Rd+1. The best-fit hyperplane is of the form wTx. This is a hyperplane of dimension d embedded in the space Rd+1.4.2. Projections, Rn viewFeature vectorsIf all the feature values for a given feature are added as components of a vector, we get a feature vector. For example, rj will have the values for the jth feature for all data-points:rjāRnWe can now look at X in two ways:Columns made up of data-pointsX=a
|
|
x1
āÆ
xn
|
|
,XT=a
-
xT1
-
-
xn
-
Rows made up of feature vectorsX=a
-
rT1
-
ā®
-
rTd
-
,XT=a
|
|
r1
āÆ
rd
|
|
The row space of X, which is the same as the column space of XT, is the span of the feature vectors. The SSE objective can be understood as follows. Find a point in the row space of X that is closest to y:
min
w||XTw-y||2
yXTwRowspace of XRnThe required point is the projection of y on the row space of X. Let XTw be this projection. It follows that y-XTw is orthogonal to the row space of X:X(y-XTw)=0This leads us back to the normal equations, XXTw=Xy. The vector w is given by (XT)ā y.5. Probabilistic PerspectiveThe final perspective is perhaps the most important. We assume that the true relationship between the features and labels is linear, corrupted by some noise. We model this noise to be a zero-mean Gaussian. The noise is independent of the features and we denote it by ši for the ith data-point:yi=wTxi+šiSetting šiā¼N(0,š2), where š2 is the variance in the noise, we have:yiā¼N(wTxi,š2)We can now use MLE to estimate w. In this, note that xi is treated as a constant. yi is the random variable.LikelihoodL(w)ānāi=1exp[-(yi-wTxi)2
2š2]Log-likelihoodAfter collecting terms that do not affect the optimization in c:l(w)=-1
2š2nāi=1(yi-wTxi)2+cMaximizing the log-likelihood is the same as minimizing the negative log-likelihood. Therefore we end up with the same optimization problem that we saw before:
min
wnāi=1(yi-wTxi)2Minimizing the SSE is equivalent to maximizing the likelihood under a zero mean, Gaussian noise assumption. This gives a probabilistic motivation to the original optimization problem involving the SSE. It gives some sort of justification for our choice of SSE as a suitable objective function for linear regression. While the SSE presents itself as an intuitive choice, the probabilistic approach justifies its use with a slightly more principled approach.Since we have turned linear regression into an estimation problem (using MLE), we will call w the maximum likelihood estimator or the ML estimator of w:wML=(XXT)-1XyOnce we have an estimator, we can ask how good it is in a statistical sense. This will be taken up in the beginning of the next document.6. Kernel RegressionSo far we have assumed f to be linear. A linear model is the right choice if the underlying data is linear. However, this is rarely the case in the real world. Most of the datasets are non-linear. A sample non-linear dataset with one feature is shown below:
xyWe can deal with non-linearity by introducing feature transformations and try to fit a linear model in the transformed feature space. The hope is that the transformed dataset is linear in this higher dimensional space. All that we have learnt about linear regression can therefore be directly brought to bear upon the non-linear case.For more on non-linear transformations, refer to week-2.6.1. LearningConsider the non-linear transformation of the features:š:RdāRDThe transformed dataset is:{(š(x1),y1),āÆ,(š(xn),yn)}The transformed data-matrix is:š(X)āRDĆnWe can now perform regression in the transformed feature space. The model is:f:RDāRf(x)=wTš(x)The loss function becomes:L(w)=||š(X)Tw-y||2Recall that we can chose the optimal weight vector w such that it lies in the span of the data-points. Therefore, we posit the existence of some š¼āRn such that:a
w
=š(X)š¼
=a
|
|
š(x1)
āÆ
š(xn)
|
|
a
š¼1
ā®
š¼n
Plugging w into the objective, the loss now becomes a function of š¼:L(š¼)=||š(X)Tš(X)š¼-y||2The matrix š(X)Tš(X) should be familiar. It is the matrix of pair-wise dot-products between points in the transformed space, the Gram matrix. This immediately suggests that we can use the kernel trick. If the kernel associated with š is:k:RdĆRdāRthe kernel matrix KāRnĆn is:K=š(X)Tš(X)which is computed element-wise asKij=k(xi,xj)Plugging the kernel matrix into the loss, the final form is:L(š¼)=||Kš¼-y||2Minimizing the L would result in:š¼=Kā yIf K is invertible, we haveš¼=K-1yIf this is not clear, refer to the write-up on the pseudoinverse earlier in this document. Comparing this with Axāb, we have to substitute A=K,b=y,x=š¼.With this, w=š(X)š¼. However, w cannot be computed since š is not explicitly specified. We infer it from the kernel. Thankfully, what we need is only the prediction for a test data-point and to compute that we have enough information.6.2. PredictionFor a test data-point xtest, the predicted label is:y=š(xtest)TwUsing w=š(X)š¼, we have:y=š(xtest)Tš(X)š¼The product before š¼ is a 1Ćn vector:š(xtest)Ta
|
|
š(x1)
āÆ
š(xn)
|
|
=a
š(xtest)Tš(x1)
āÆ
š(xtest)Tš(xn)
Each component of this vector is nothing but a kernel function evaluation. Expanding the RHS of the predicted label:y=nāi=1š¼iā k(xtest,xi)This predicted label has a nice interpretation. It depends on two things:⢠š¼i: importance of ith data-point in predicting the label for xtest⢠k(xtest,xi): similarity between xtest and xi