Week-5
1. Supervised Learning 1.1. Regression 1.2. Classification 2. Regression 3. Linear Regression 3.1. Model 3.2. Loss Function 3.3. Optimization 3.3.1. Gradient 3.3.2. Normal equations 3.4. Gradient descent 3.5. Stochastic Gradient Descent 3.6. Evaluation 4. Geometric Perspective 4.1. Best-fit surface 4.2. Projections 5. Probabilistic Perspective 6. Kernel Regression 6.1. Learning 6.2. Prediction 1. Supervised Learning 1.1. Regression Given a dataset D={(x1,y1),⋯,(xn,yn)}, where yi∈R, learn a function f:Rd→R that captures the underlying relationship (pattern) between data-points and labels and generalizes it to unseen data-points. The defining characteristic of a regression problem is that the labels are continuous valued real numbers. 1.2. Classification Given a dataset D={(x1,y1),⋯,(xn,yn)}, where yi∈{1,⋯,k}, learn a function f:Rd→{1,⋯,k} that captures the underlying relationship (pattern) between data-points and labels and generalizes it to unseen data-points. The defining characteristic of a classification problem is that the labels are from a finite, discrete set. If k=2, the problem is called binary classification. If k>2, the problem is called multi-class classification. 2. Regression If f:Rd→R is a potential mapping (model) from data-points to labels, we use the following metric to measure the goodness of the model. L(f,D)=1
2
nāˆ‘i=1[f(xi)-yi]2
• yi → true label for xi• f(xi)→predicted label• f(xi)-yi→error in prediction This is called the sum of squared errors or SSE. The 0.5 is added for mathematical convenience. Dividing this by n gives the mean squared error or MSE: L(f,D)=1
2n
nāˆ‘i=1[f(xi)-yi]2
The task is to find a function f that minimizes the loss function on the training dataset while generalizing well to unseen data-points. Since the class of functions is so huge, we usually choose a more tractable set. Almost any ML algorithm can be understood in terms of four components: • Model• Loss function• Optimization• Evaluation We will study a specific approach called linear regression. 3. Linear Regression 3.1. Model When f is assumed to be a linear function, the resulting solution approach is called linear regression: f(x)=wTx=w1x1+⋯+wdxd where w∈Rd is called the weight vector. We also call f a linear model. For a given data-point-label pair (xi,yi), f(xi) is called the predicted label and yi is the true label. For all data-points, the predicted label vector can be expressed as: y=XTwExpanded as: y=a
xT1w
ā‹®
xTnw
where, y∈Rn. 3.2. Loss Function The SSE loss function with a linear model expressed in three different but equivalent forms: Vector-scalar form L(w)=1
2
nāˆ‘i=1[wTxi-yi]2
Vector-vector form If y=XTw, we have: L(w)=1
2
||y-y||2
Matrix-vector form L(w)=1
2
||XTw-y||2
An visualization of the squared loss and its contours for w∈R2:• The height of the surface above the w1-w2 plane is the value of the loss function.• The SSE loss function is convex. 3.3. Optimization The task is to find a w that minimizes the loss: w=
min
w
1
2
nāˆ‘i=1[wTxi-yi]2
The SSE loss function L(w) is convex. Hence, all its local minima are global minima. 3.3.1. Gradient The gradient of the SSE is given as follows: āˆ‡L(w)=a
āˆ‚L(w)
āˆ‚w1
ā‹®
āˆ‚L(w)
āˆ‚wd
Vector-scalar form āˆ‡L(w)=nāˆ‘i=1[wTxi-yi]xi Matrix-vector form āˆ‡L(w)=XXTw-Xy 3.3.2. Normal equations The system linear equations equations below are called the normal equations: XXTw=Xy This system is consistent for any choice of X and y. One special solution is given by: w=(XT)†y
Why do we need the pseudo-inverse? (out of syllabus, can be skipped): Consider the system: Ax=b If Ax=b is not consistent, you can still hunt for an approximate solution by looking for x that minimizes ||Ax-b||2. In other words, we are trying to get as close to b as possible: Axā‰ˆb The resulting solution is given using the pseudo-inverse of A: x=A†b In the linear regression setting, A=XT,x=w,b=y: XTwā‰ˆy The solution is given by: w=(XT)†y If rank(X)=d, then the pseudo-inverse of XT can be written as: (XT)†=(XXT)-1X Note that the pseudo-inverse is defined for all matrices. For more details, refer to Moore Penrose inverse. If Ax=b is consistent, A†b gives the solution that has the least norm.
(XT)† is the pseudo-inverse of XT and is a dƗn matrix. w lies in the span of the data-points. If rank(X)<d, there are infinitely many solutions. Out of all of them, w is the solution that has the least norm. Any solution to the system can be written as w+wāŸ‚, where wāŸ‚ is perpendicular to the span of the data-points since: XXT(w+wāŸ‚)=Xy Specifically, if rank(X)=d the solution is unique and is given by: w=(XXT)-1Xy w is sometimes termed as the least squares solution, since it is obtained by minimizing the SSE. 3.4. Gradient descentThe optimum weight vector can also be found using gradient descent. If w(t) represents the weight vector at time t, the update is given by: w(t+1)=w(t)-šœ‚āˆ‡L(w)|w(t) Substituting the gradient: w(t+1)=w(t)-šœ‚(XXTw(t)-Xy) šœ‚>0 is the learning rate. Since the objective is convex, gradient descent is guaranteed to converge for suitable choices of the learning rate šœ‚. šœ‚ is called a hyperparameter (something that you choose). 3.5. Stochastic Gradient Descent For large datasets, computing the entire gradient in one go might be costly. In such cases, a random subset (random sample) of the dataset is extracted. The approximate gradient from this subset is used in the update equation. If Xā€²āˆˆRdƗm represents a sample of m points: w(t+1)=w(t)-šœ‚[(X′)(X′)Tw(t)-X′y] The progress of SGD is less smooth given that we are only using an approximate version the gradient. Eventually both reach the minimum. 3.6. EvaluationThe model's performance is evaluated on a test dataset which is a collection of data-points not seen during training. A good performance metric is the root mean squared error (RMSE): 1
n
nāˆ‘i=1(wTxi-yi)2
This has the same units as the label which the error more interpretable. 4. Geometric Perspective 4.1. Best-fit surface The geometry of the relationship between predicted label and features in linear regression takes the following form in different dimensions:
dSpaceObjectInterpretation
1R2Line passing through the originBest-fit line
2R3Plane passing through the originBest-fit plane
⩾3Rd+1Hyperplane passing through the originBest-fit hyperplane
Here is a visualization of the first two cases: Each blue point here represents (xi,yi), where yi is the true label. Note that the point (xi,yi) need not lie on the hyperplane. However, the point (xi,yi) will lie on the hyperplane. 4.2. Projections Feature vectors If all the feature values for a given feature are added as components of a vector, we get a feature vector. For example, rj will have the values for the jth feature for all data-points: rj∈Rn We can now look at X in two ways: Columns made up of data-points X=a
| |
x1⋯xn
| |
,XT=a
-xT1-
-xn-
Rows made up of feature vectors X=a
-rT1-
ā‹®
-rTd-
,XT=a
| |
r1⋯rd
| |
The row space of X, which is the same as the column space of XT, is the span of the feature vectors. The SSE objective can be understood as follows. Find a point in the row space of X that is closest to y:
min
w
||XTw-y||2
The required point is the projection of y on the row space of X. This means that y-XTw is orthogonal to the row space of X: X(y-XTw)=0 This leads us back to the normal equations, XXTw=Xy. The projection of y onto the row space of X is given by (XT)†y. 5. Probabilistic Perspective We assume that yi is generated by adding a noise term to wTxi: yi=wTxi+šœ–i Assuming šœ–i∼N(0,šœŽ2), we have: yi | xi∼N(wTxi,šœŽ2) We can now use MLE to estimate w. Likelihood L(w)āˆnāˆi=1exp[-(yi-wTxi)2
2šœŽ2
]
Log-likelihood After collecting terms that do not affect the optimization in c: l(w)=-1
2šœŽ2
nāˆ‘i=1(yi-wTxi)2+c
Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. Therefore we end up with the same optimization problem that we saw before:
min
w
nāˆ‘i=1(yi-wTxi)2
Minimizing the SSE is equivalent to maximizing the likelihood under a zero mean, Gaussian noise assumption. 6. Kernel Regression6.1. LearningConsider the non-linear transformation of the features: šœ™:Rd→RD The transformed dataset is: {(šœ™(x1),y1),⋯,(šœ™(xn),yn)} The transformed data-matrix is: šœ™(X)∈RDƗn We can now perform regression in the transformed feature space. The model is: f:RD→R f(x)=wTšœ™(x) The loss function becomes: L(w)=||šœ™(X)Tw-y||2 Since w lies in the span of the data-points, for coefficient vector š›¼āˆˆRn w=šœ™(X)š›¼=a
| |
šœ™(x1)ā‹Æšœ™(xn)
| |
a
š›¼1
ā‹®
š›¼n
The loss now becomes: L(w)=||šœ™(X)Tšœ™(X)š›¼-y||2 If the kernel associated with šœ™ is: k:RdƗRd→R the kernel matrix K∈RnƗn is: K=šœ™(X)Tšœ™(X) which is computed element-wise as Kij=k(xi,xj) With this, the loss function's final form is: L(w)=||Kš›¼-y||2 Minimizing the L would result in: š›¼=K†y If K is invertible, we have š›¼=K-1y With this, w=šœ™(X)š›¼, which cannot be explicitly known since šœ™ is not explicitly computed. 6.2. Prediction For a test data-point xtest, the predicted label is: y=šœ™(xtest)Tw Using w=šœ™(X)š›¼, we have: y=šœ™(xtest)Tšœ™(X)š›¼ The product before š›¼ is a 1Ɨn vector: šœ™(xtest)Ta
| |
šœ™(x1)ā‹Æšœ™(xn)
| |
=a
šœ™(xtest)Tšœ™(x1)ā‹Æšœ™(xtest)Tšœ™(xn)
Expanding the RHS of the predicted label: y=nāˆ‘i=1š›¼iā‹…k(xtest,xi) • š›¼i: importance of ith data-point in predicting the label for xtest• k(xtest,xi): similarity between xtest and xi