Week-10 | Summary

Week-10

1. Margin 2. Hard-Margin Support Vector Machines 3. Solving the Optimization Problem 3.1. Primal 3.2. Dual 3.3. Solution 3.4. Support Vectors 3.4.1. Primal View 3.4.2. Dual view 4. Prediction 5. Example 6. Kernel SVM

1. MarginClassifiers with a larger margin generalize better. That is, they perform better on test data-points compared to classifiers that have a small margin. Here is a visual intuition for to justify this: Given a linear classifier, let us make the notion of margin more precise. There are two kinds of margin. Consider a linearly separable dataset with a positive margin. Let w be any valid classifier that perfectly separates the dataset. We can always find at least one point from the dataset that is closest to the decision boundary. Call this point x

. For convenience, let this point have y=1. This point will lie on some line parallel to the decision boundary, say w

x=𝛾. This 𝛾 is termed the functional margin. From our assumption, we have:

(

)

⩾𝛾, 1⩽i⩽n Visually:The distance of between the lines w

x=0 and w

x=𝛾 is called the geometric margin and is given by:

||w||

𝛾

||w||

Since both w and 𝛾 can be scaled simultaneously without affecting the setup, we will set 𝛾=1. The corresponding figure becomes: Therefore, for any linearly separable dataset with a positive margin, and a valid classifier w, we can set the functional margin 𝛾=1 and the geometric margin becomes:

(geometric) margin

||w||

The margin is often used to denote both the distance as well as the hyperplanes w

x=±1. The context will determine what it means. 2. Hard-Margin Support Vector MachinesNow that the (geometric) margin is defined, we would like to find a classifier that has the maximum margin:

max

||w||

But we also have constraints: Since the closest points to the classifier lie on the margin, all other points have to lie beyond the margin, giving us the following n constraints:

(

)

⩾1, 1⩽i⩽n Putting these two together, we have:

max

||w||

subject to

(

)

⩾1, 1⩽i⩽n A slight modification to the objective function results in:

min

||w||

subject to

(

)

⩾1, 1⩽i⩽n 3. Solving the Optimization Problem3.1. PrimalSince the objective function is convex and since the inequality constraints involve convex functions, the entire problem is a convex optimization problem. We can recast it in the standard form as:

min

||w||

subject to

(

)

⩽0, 1⩽i⩽n We will call this the primal form. In the primal form, the variable we are optimizing over is w and the constraints are called the primal constraints. There are n constraints, one corresponding to each data-point. We now introduce the Lagrangian function with 𝛼

being the i

Lagrange multiplier:

||w||

+n∑i=1𝛼

[

(

)

]

We can equivalently express the original optimization problem as follows:

min

max

𝛼⩾0

||w||

+n∑i=1𝛼

[

(

)

]

3.2. DualThe dual problem can be expressed as:

max

𝛼⩾0

min

||w||

+n∑i=1𝛼

[

(

)

]

The dual objective is itself an unconstrained optimization problem which we will solve now:

min

||w||

+n∑i=1𝛼

[

(

)

]

Setting the gradient to zero, we get the weight vector to be a linear combination of the data-points: w=n∑i=1𝛼

In matrix notation: w=XY𝛼 Here X is the d×n data-matrix, Y is a n×n diagonal matrix whose diagonal elements are made up of the labels of the n data-points. 𝛼 is a n×1 vector that has the Lagrange multipliers. X=

\|		\|
x1	⋯	xn
\|		\|

, Y=

y1
	⋱
		yn

, 𝛼=

𝛼

⋮

𝛼

Plugging w=XY𝛼 back into the dual objective, we get the following form: n∑i=1𝛼

𝛼

(

)

𝛼The dual optimization problem therefore becomes:

max

𝛼⩾0 𝛼

𝛼

(

)

𝛼

Here, 1 represents a vector of n ones. The objective function is a quadratic function of 𝛼 and is concave. 3.3. SolutionFor a convex optimization problem, under certain conditions, the primal and dual have the same optimal value. Solving the dual problem is preferable for the following reasons: • The constraints are simpler.• The appearance of X

X points to kernels. We don't discuss the solution method here. It is enough to know that a solution exists. If the optimal dual variable is denoted by 𝛼

and the optimal primal variable is denoted by w

, the two are related as:

=Xy𝛼

The schematic diagram that summarizes the discussion so far. w

x=0 is the decision boundary. w

x=±1 are called the supporting hyperplanes. We will see why this name is used in the next section. 3.4. Support VectorsComplementary slackness conditions enforce the following equations: 𝛼

[

(

)

]

=0, 1⩽i⩽n 3.4.1. Primal View • If

(

)

>1, we have 1-

(

)

<0, which forces 𝛼

=0. For a point which lies beyond the supporting hyperplanes, the value of 𝛼

is zero.• • If

(

)

=1, we have 1-

(

)

=0, which doesn't really force 𝛼

to be a particular value. So we just settle for 𝛼

⩾0. For a point which lies on the supporting hyperplanes, we can't comment on the value of 𝛼

. It can be any non-negative quantity. 3.4.2. Dual view• If 𝛼

=0, we can't comment on the position of the point. It could either lie on the supporting hyperplane or beyond.• • If 𝛼

>0, we have 1-

(

)

=0⟹

(

)

=1. A point with 𝛼

>0 is going to lie on one of the supporting hyperplanes.•

Support VectorsThe points for which 𝛼

>0 are called support vectors.

Some observations concerning support vectors: • All support vectors lie on the supporting hyperplanes. • But note that every point on the supporting hyperplanes need not be a support vector. • The weight vector can now be seen as a sparse linear combination of the data-points. The sparsity comes from the fact that all non-support vectors have 𝛼

=0.• The support vectors are the most important training data-points as they lend their support in building the decision boundary. 4. PredictionOnce a SVM has been trained, it can be used to predict the label of a test point like any other linear classifier: y=

1,		(w*)Txtest⩾0
-1,		(w*)Txtest<0

Note that the constraint

(

)

⩾1 is binding on only the training data-points. 5. ExampleConsider the following dataset: The optimization problem is:

min

subject to:

w1	⩾1	(1)
w2	⩽-1	(2)
3w1+2w2	⩾1	(3)
w1	⩾1	(4)
w2	⩽-1	(5)
3w1+2w2	⩾1	(6)

On simplifying the constraints and plotting them along with the contours of the objective: We see that the primal solution is w

-1

. We can also solve for the dual. One possible solution is 𝛼

2/3

1/3

. This is an example of a linear hard-margin SVM problem. 6. Kernel SVMWe have the following correspondence between linear SVM and kernel SVM. 𝜙 is typically a function that cannot be explicitly computed.

Linear SVM

Kernel SVM

𝜙

(

)

𝜙

(

)

max

𝛼⩾0 𝛼

𝛼

(

)

𝛼

max

𝛼⩾0 𝛼

𝛼

(

)

𝛼

=XY𝛼

=𝜙

(

)

Y𝛼

1,		(w*)Txtest⩾0
-1,		otherwise

1,		n∑i=1𝛼*ik(xi,xtest)⩾0
-1,		otherwise

An example of a dataset that is not linearly separable in the original feature space but can be separated after a non-linear transformation. A quadratic kernel has been used here. The decision boundary and the two supporting hyperplanes (when brought back to the original feature space) are also displayed: Note that the decision boundary is not equidistant to the two supporting curves. This is because the actual optimization happens in the transformed space

(

in this example) and we are visualizing the result in the original space. In the transformed space, the corresponding supporting hyperplanes will indeed be equidistant to the boundary.

1,		(w*)Tx⩾0
-1,		(w*)Tx<0