Week-6 | Summary

Week-6

1. MSE 2. Overfitting and Underfitting 2.1. Toy dataset 2.2. Overfitting 2.3. Data Augmentation 2.4. Underfitting 3. Model Complexity 4. Regularization 4.1. Ridge Regression 4.2. LASSO Regression 5. Cross Validation 5.1. k-Fold cross validation 5.2. Leave one out cross validation 6. Probabilistic view

1. MSEThe label vector y is a random vector with mean X

w and covariance matrix 𝜎

I: y=N

(

w,𝜎

)

The ML estimator in linear regression is also a random vector: w=

(

)

†

y For this discussion, let us assume that

rank

(

)

=d, so that: w=

(

)

-1

Xy The expected value of the estimator is E

[

]

=w, which means that this is an unbiased estimator. The MSE of this estimator is given by: E

[

||w-w||

]

=𝜎

⋅

trace

[

(

)

-1

]

2. Overfitting and Underfitting2.1. Toy datasetConsider a regression problem in which the true relationship is given by: g

(

)

sin

2𝜋x, 0⩽x⩽1 The label is a noisy, corrupted version of this function. The noise is an error term 𝜖 that is sampled from a zero mean Gaussian with variance of 0.1: 𝜖∼N

(

0,0.1

)

The label is therefore: y

| x

∼N

(

)

,0.1

)

This is the data-generation part. In reality, we don't know the functional form of g

(

)

. This is a toy dataset meant to aid in visualizing things. Let us look at a dataset of size n=10: D=

{

(

)

,⋯,

(

)

}

We can turn this into a linear regression problem by forming non-linear polynomial features. Let us assume that we settle for a third degree polynomial. The feature vector is: x=

X is going to be a d×n matrix with d=4 and n=10. d=4 since we are also adding an intercept. The label vector is y. Concretely, X=

1	⋯	1
x1	⋯	x10
x21	⋯	x210
x31	⋯	x310

, y=

⋮

The weight vector is going to be: w=

Therefore, our model is:

f(x)	=w0+w1x+w2x2+w3x3
	=wTx

Remark: As far as the original problem is concerned, f is a non-linear function of the feature. After adding more (polynomial) features, f is linear in the weights. This is still called a linear regression problem as far the weights are concerned.

We can now learn the optimal weights by minimizing the SSE. We get: w=

(

)

-1

Xy The weights we end up with are: w=

-0.04

9.54

-28.12

18.61

If we plot the function learnt, this is how it looks:

It seems like a reasonable fit. 2.2. OverfittingWhat if we change p=9. The resulting data-matrix will be of shape d×n with d=10,n=10.

f(x)	=w0+w1x+⋯+w9x9
	=wTx

0.02
62.12
-1383.34
13019.05
-63619.93
178687.61
-300179.44
297927.91
-160994.34
36480.26

The model fits the noise in addition to the signal. This problem is termed overfitting. Also note the really high magnitudes for the components of the weight vector. An overfit model is said to be a high variance model: small changes in the dataset can result in wildly different models. For example: Note the boxed region. A small change in the position of one or two data-points caused a big change in the fitted curve. 2.3. Data AugmentationIncreasing the data or augmenting it (data-augmentation) is one way of mitigating the problem of overfitting. For n=100, the result is as follows:

Data Augmentation is a popular technique for increasing the size of the dataset: Coming back to the regression problem, the weight vector for this model is: w=

-0.06
10.91
-85.89
609.29
-2557.32
5953.79
-8253.17
6949.81
-3318.79
691.52

Note the relatively smaller values for the weights. Models that are really complex tend to overfit. A high-degree polynomial is one such example. 2.4. UnderfittingUnderfitting is the opposite problem of being unable to capture the signal in the data:

A line is too simple for the given data and completely misses the patterns. When a model underfits, it is usually a good idea to fit a more complex model. 3. Model ComplexityIf the train loss is low and the test loss is high, the model is overfitting. If the train loss is itself high, the model is underfitting. Here is a general plot of train and test loss with respect to model complexity: Note how the training loss keeps decreasing as the models get more and more complex. 4. RegularizationRegularization is a method to mitigate overfitting. We can't always resort to adding more training data. In such situations, we impose an a constraint on the model's weights. We study two forms of regularization: 4.1. Ridge Regression The loss function is modified as follows. This is called the regularized loss function:

(1)

(

)

⋅n∑i=1

(

-y

)

⏠

⏣

⏡

⏣

⏢

SSE

𝜆

||w||

⏠

⏣

⏡

⏣

⏢

regularization term

where: ||w||

+⋯+w

There are two aspects to the regularized loss:• data-dependent loss: SSE• model-dependent loss: regularization term– controls the complexity of the model ||w||

is the L

norm of w. This is the usual Euclidean norm. 𝜆 is called the regularization rate and is a hyperparameter.

Hyperparameters versus Parameters • parameters are learnt from data• hyperparameters are chosen prior to the learning– however, the best hyperparameter for a given problem is still determined by making use of the (validation) data

For 𝜆=0.001, this is the resulting model:

The corresponding weight vector: w=

0.14
5.38
-10.52
-4.02
2.33
4.89
4.44
2.22
-0.81
-4.10

Ridge regression has the effect of pushing the weights nearer to zero though not exactly to zero. Let us see how to solve for w: L

(

)

⋅n∑i=1

(

-y

)

⏠

⏣

⏡

⏣

⏢

SSE

𝜆

||w||

⏠

⏣

⏡

⏣

⏢

regularization term

On setting the gradient to zero we have XX

w-Xy+𝜆w=0 Simplifying:

(

+𝜆I

)

w=Xy Therefore: w=

(

+𝜆I

)

-1

Xy Geometric view The optimization problem can be equivalently formulated as:

min

⋅n∑i=1

(

-y

)

⏠

⏣

⏡

⏣

⏢

SSE

subject to ||w||

⩽𝜃 where 𝜃 is some scalar dependent on 𝜆. Plotting the contours of the SSE and the constraint region: Contours of the SSE

||X

w-y||

=||X

-y||

From the normal equations, we have XX

=Xy. Using this, we get:

(

w-w

)

(

)

(

w-w

)

=c Quadratic expression in w. 4.2. LASSO Regression L

(

)

⋅n∑i=1

(

-y

)

⏠

⏣

⏡

⏣

⏢

SSE

+𝜆||w||

⏠

⏣

⏡

⏣

⏢

regularization term

where: ||w||

=|w

|+⋯+|w|

is the L

norm. It is also the Manhattan distance of the point w from the origin in R

. No closed form solution can be found for LASSO regression. With a high probability, the contours of the SSE will hit one of the corners of the square. This results in a sparse solution. LASSO: Least Absolute Shrinkage Selection Operator• Least: we are solving a minimization problem• Absolute: the L

norm uses the sum of the absolute values• Shrinkage: the weights are going to become smaller• Selection: since some weights go to zero, it acts as a feature selector 5. Cross ValidationAs mentioned earlier, 𝜆 is called hyperparameter. It is chosen using a process called cross validation. 5.1. k-Fold cross validation

Algorithm • For 𝜆 in

[

-4

,10

-3

,⋯,10

]

• – For k rounds:* Train on k-1 folds and test on the remaining fold.· The size of the training dataset is

(

k-1

)

· The size of the validation dataset is

· Store the validation loss* Average the validation losses from the k models• • Return 𝜆 that gives least average validation loss

5.2. Leave one out cross validation

Algorithm • For 𝜆 in

[

-4

,10

-3

,⋯,10

]

• – For n rounds:* Train on n-1 data-points and test on the remaining data-point.· The size of the training dataset is n-1· The size of the validation dataset is 1· Store the validation loss* Average the validation losses from the n models• • Return 𝜆 that gives least average validation loss

This process of determining the best value of 𝜆 for a given problem is called hyperparameter tuning. 6. Probabilistic viewWe have: y∼N

(

w,𝜎

)

Prior w∼N

(

0,𝛾

)

(

)

(

𝛾

2𝜋

)

exp

[

-||w||

2𝛾

]

Likelihood L

(

w;D

)

(

𝜎

2𝜋

)

exp

[

-||y-X

w||

2𝜎

]

Posterior

∝

Prior

Likelihood

(

w | D

)

∝

exp

[

-||y-X

w||

2𝜎

-||w||

2𝛾

]

MAP To get a point estimate for w, we maximize the posterior. This is called the maximum a posteriori estimate. Taking

log

for convenience:

max

-||y-X

w||

2𝜎

-||w||

2𝛾

Maximizing this is the same as minimizing the negative of the objective:

min

||y-X

w||

2𝜎

||w||

2𝛾

Multiplying throughout by 𝜎

and setting 𝜆=

𝜎

𝛾

, we have:

min

||y-X

w||

𝜆

||w||

This is the same as the ridge regression problem. Thus the MAP estimate with a zero mean Gaussian prior is equivalent to the solution to the ridge regression. The inverse relationship between 𝜆 and 𝛾 can be understood as follows:• Smaller the value of 𝛾, tinier the region in which we would like w to lie. • Alternatively, higher the value of 𝜆, the smaller we want the norm of the weights to be.