Week-11 | Summary

Week-11

1. Soft Margin SVM 1.1. Soft Margin 1.2. Optimization Problem 1.3. Hinge Loss Formulation 1.4. Solving the Optimization Problem 1.5. Primal and Dual 1.6. Complementary Slackness and Support vectors 1.6.1. Primal view 1.6.2. Dual view 1.7. SVM Scenarios 2. Overfitting and Underfitting 3. Ensemble Techniques 3.1. Bagging 3.1.1. Bootstrapping 3.1.2. Aggregation 3.1.3. Distribution of points in a bag 3.1.4. Random Forest 3.2. Boosting 3.2.1. AdaBoost

1. Soft Margin SVM1.1. Soft MarginWe will get rid of the linear separability assumption. In the context of SVM, this would mean that we allow points to violate the margin. That is, we are not going to stipulate a strict inequality of the form

(

)

⩾1. Instead, we are going to be more lenient and allow some slack. Hence, the origin of the name "soft-margin". The two supporting hyperplanes continue to remain at w

x=±1. In order to penalize data-points that violate the margin, we assign a non-negative quantity 𝜉

for each data-point. Consider the specific case of a positive data-point violating the margin: We can think of 𝜉

as the minimum (scaled) distance that the point has to move in the correct direction so that it no longer violates the margin. Here are some sample points along with their 𝜉

values: Points that violate the margin and are farther away from the correct supporting hyperplane suffer greater penalty. Points that are beyond the margin and on the right side of it do not suffer any penalty. This is expressed in a neat formula: 𝜉

max

(

0,1-

(

)

1.2. Optimization Problem We can now formulate the optimization problem as follows:

min

w,𝜉

||w||

+C⋅n∑i=1𝜉

subject to:

(wTxi)yi+𝜉i	⩾1
𝜉i	⩾0

, 1⩽i⩽n The objective has a new term n∑i=1𝜉

which is the total penalty paid across all data-points. C is a hyperparameter that controls the trade-off between the width of the margin and the violations. 1.3. Hinge Loss Formulation Since we know the exact expression for 𝜉

, we can get rid of the constraints and move it into the objective:

min

||w||

+C⋅n∑i=1

max

(

0,1-

(

)

The expression n∑i=1

max

(

0,1-

(

)

is termed the hinge loss. We can interpret the two terms in the loss as follows:

||w||

⏠

⏣

⏡

⏣

⏢

margin

+C⋅n∑i=1

max

(

0,1-

(

)

⏠

⏣

⏡

⏣

⏢

penalty

The first term controls the width of the margin, the second controls the total penalty levied on the data-points. These two terms work in opposite directions. That is: • If ||w|| is small, the margin is wide. If the margin is wide, the penalty paid by points will be generally high.• If ||w|| is large, the margin is small. If the margin is small, the penalty paid by points will be generally low. Visually: • A very small value of C implies that we don't mind the penalties, which encourages wide margins. In the limit as C

→

0, the supporting hyperplanes run away to infinity.• For a very large C, we are very particular about small penalties, encouraging smaller margin. As C

→

∞, we approach hard-margin SVM. Visually: Alternatively, the soft-margin objective can be interpreted as the sum of a data-dependent loss and a regularization term

(

)

plays the role of the regularization rate.

||w||

⏠

⏣

⏡

⏣

⏢

Regularization

+C⋅n∑i=1

max

(

0,1-

(

)

⏠

⏣

⏡

⏣

⏢

Hinge Loss

1.4. Solving the Optimization ProblemWe will revisit the optimization problem in its original form:

min

w,𝜉

||w||

+C⋅n∑i=1𝜉

subject to:

(wTxi)yi+𝜉i	⩾1
𝜉i	⩾0

, 1⩽i⩽n Introducing the Lagrangian function and Lagrange multipliers, we get an equivalent formulation. 1.5. Primal and Dual Primal

min

w,𝜉

max

𝛼⩾0,𝛽⩾0

||w||

+C⋅n∑i=1𝜉

+n∑i=1𝛼

(

)

-𝜉

)

-n∑i=1𝛽

𝜉

Dual

max

𝛼⩾0,𝛽⩾0

min

w,𝜉

||w||

+C⋅n∑i=1𝜉

+n∑i=1𝛼

(

)

-𝜉

)

-n∑i=1𝛽

𝜉

Solving the unconstrained optimization problem within the dual, we get: w=n∑i=1𝛼

and 𝛼

+𝛽

=C, 1⩽i⩽n As before, we can rewrite w=XY𝛼. Since both 𝛼

and 𝛽

are non-negative, we see that 𝛼

,𝛽

∈

[

0,C

]

for all data-points. The dual problem therefore becomes:

max

0⩽𝛼⩽C⋅1 𝛼

𝛼

(

)

𝛼 This is nearly identical to the hard-margin form. The only change between the two forms is the constraints. As before, we won't solve the dual, but assume that there are solvers that would give us the optimal values for 𝛼

. 1.6. Complementary Slackness and Support vectors

DefinitionA support vector is a point for which 𝛼

>0.

The complementary slackness conditions are: 𝛼

[

(

)

-𝜉

]

=0, 1⩽i⩽n 𝛽

𝜉

=0, 1⩽i⩽n Let us also add the primal and dual constraints: Primal constraints𝜉

⩾0,

(

)

+𝜉

⩾1We also know:𝜉

max

(

0,1-

(

)

Dual constraints𝛼

⩾0, 𝛽

⩾0, 𝛼

+𝛽

=C In the following tables, here is the naming convention:• Correct supporting hyperplane: w

x=1 for positive data-points and w

x=-1 for negative data-points.• Right side of the correct supporting hyperplane: w

x>1 for positive data-points and w

x<-1 for negative data-points.• Wrong side of the correct supporting hyperplane: w

x<1 for positive data-points and w

x>-1 for negative data-points. 1.6.1. Primal view

Situation	Inference	Interpretation
(w*Txi)yi>1	• 𝜉i=0• 𝛼i=0	Points that lie on the right side of the correct supporting hyperplane are not support vectors.
(w*Txi)yi=1	• 𝜉i=0• 𝛼i∈[0,C]	Points that lie on the correct supporting hyperplane could be support vectors.
(w*Txi)yi<1	• 𝜉i>0• 𝛼i=C	Points that lie on the wrong side of the correct supporting hyperplane are some of the most important support vectors.

1.6.2. Dual view

Situation	Inference	Interpretation
𝛼*i=0	• 𝜉i=0• (wTxi)yi⩾1	Points that are not support vectors lie on or on the right side of the correct supporting hyperplane.
𝛼*i∈(0,C)	• 𝜉i=0• (wTxi)yi=1	Points that are support vectors with 𝛼*i≠C lie on the correct supporting hyperplanes.
𝛼*i=C	• 𝜉i⩾0• (wTxi)yi⩽1	Points that are the most important support vectors may either lie on the correct supporting hyperplane or on the wrong side of it.

1.7. SVM ScenariosWe have the following possibilities:

Type of Dataset	Type of SVM
	Hard-margin, Linear SVM
	Soft-Margin, Linear SVM
	Hard-Margin, Kernel SVM
	Soft-Margin, Kernel SVM

The kernel is also a hyperparameter. 2. Overfitting and Underfitting• Bias: This measures the accuracy of the predictions or how close the model is to the true labels. Smaller the error, lower the bias. As the complexity of the model increases, bias tends to decrease. High bias models tend to underfit.• Variance: This measures the variability in the model to fluctuations in the training dataset. As the complexity of the model increases, variance tends to increase. High variance models tend to overfit. Bias and variance work in opposite directions. Models with a low bias tend to have high variance. Models with a high bias tend to have low variance. Examples of low bias, high variance models:• Deep decision trees• Kernel regression with a polynomial kernel of high degree Examples of high bias, low variance models:• Decision stumps and very shallow decision trees• Vanilla linear regression on a non-linear dataset 3. Ensemble TechniquesEnsemble techniques attempt to aggregate or combine multiple models to arrive at a decision.3.1. BaggingBagging or bootstrap aggregation is a technique that tries to reduce the variance. The basic idea is this: averaging a set of observations reduces the variance. Instead of averaging observations, we will be averaging over models trained on different datasets. There are two steps: 3.1.1. Bootstrapping We start with the dataset D and create multiple versions or bags by sampling from D with replacement. Each bag has the same number of data-points as D. Let us call the number of bags B. For example with B=3: Note that points can repeat in a given bag. 3.1.2. Aggregation We train a model on each bag. Typically, this is a high-variance model, such as a deep decision tree. We call this h

, the model for the i

bag. If there are B bags, we can aggregate the outputs of all these models. • For a regression problem, the method of aggregation could be the mean:

B∑i=1h

(

)

• For a classification problem, the method of aggregation could be the majority vote:

sign

(

B∑i=1h

(

)

3.1.3. Distribution of points in a bagFix a bag and consider an arbitrary data-point, say

(

)

. The probability that this point gets picked as the first point in the bag is

. The probability that this point doesn't get picked as the first point is 1-

. The probability that this point doesn't get picked at all in this bag is

(

)

, since we are sampling with replacement (independence). Therefore, the probability that this point appears at least once in a bag is (gets picked at all):1-

(

)

For a very large dataset, this is about 1-

≈63%. Therefore, about 63% of the data-points in a bag is unique. 3.1.4. Random Forest Random forests bag deep decision trees with a slight twist. While growing each tree, at each split (question-answer pair), instead of considering all d features, it randomly samples a subset of m features to choose from. m is typically around

. In this way, the resulting trees are uncorrelated. Since the trees are grown independently on their own bags, the algorithm can be run in parallel. 3.2. BoostingA weak learner is a model that does slightly better than a random model. For example, a model that produces an error rate of 0.48 would be a weak model. An error rate of 0.48 would mean that the model produces misclassifies 48% of the data-point. Decision stump is an example of weak model. In boosting, we combine weak learners, typically high bias models such as decision stumps, and turn them into strong learners, with a low bias. This is achieved by a sequential algorithm that focuses on mistakes committed at each step. 3.2.1. AdaBoost

AdaBoost(D, weak-learner)• D

(

)

, 1⩽i⩽n • Repeat for T rounds• – h

←

Train a weak learner on

(

D,D

t-1

)

– – 𝜖

error

(

)

= ∑h

(

)

≠y

t-1

(

)

– – 𝛼

(

1-𝜖

𝜖

)

– – D

(

)

e𝛼t⋅Dt-1(i),		ht(xi)≠yi,		(mistake)
e-𝛼t⋅Dt-1(i),		ht(xi)=yi,		(correct)

– – D

(

)

(

)

n∑i=1D

(

)

• • Return

sign

(

T∑t=1𝛼

⋅h

)

Observations:• Mistakes are boosted by e

𝛼

and correct data-points are brought down by e

-𝛼

.• If 𝜖

is high, then 𝛼

is low. That is, if the classifier in round t has a higher error, it is assigned a lower weight in the final ensemble. It can be shown that as the number of rounds increases, the training error keeps decreasing. The training error of the final classifier after T rounds is at most:

exp

(

-2T∑t=1𝛾

)

where 𝛾

-𝜖