Week-7 | Summary

Week-7

1. Zero-One Error 2. Linear Classifier 3. KNN 4. Decision Trees 4.1. Binary tree 4.2. Entropy: Node impurity 4.3. Decision Stump 4.4. Growing a Tree 5. References

1. Zero-One Error If f:R

→

{

1,0

}

: L

(

D,f

)

n∑i=1I

[

≠f

(

)

]

This is the average number of errors in prediction by the classifier. 2. Linear Classifier f:R

→

{

1,-1

}

(

)

sign

(

)

1,		wTx⩾0
-1,		wTx<0

The optimization problem given below is NP hard.

min

n∑i=1I

[

≠

sign

(

)

]

Using the SSE for classification is not a good idea. The SSE is sensitive to outliers.

min

n∑i=1

(

-y

)

Besides, modeling classification as a regression problem has another issue. There is no natural ordering among the labels in a classification problem. For example, if the labels are "cat", "dog", "mouse", all permutations of the labels

{

1,2,3

}

are equally valid. A regression based approach would however take into account the ordering between

{

1,2,3

}

. There is a definite sense in which 3>2>1 and 3-2=2-1=1. However, no such order exists among the labels "cat", "dog" and "mouse". 3. KNNPrediction for a given test-point:• Find d

(

test

)

for all i• Sort distances in ascending order• Find the labels of the first k points• Return the most frequently occurring label Usually an odd number is chosen for k to avoid ties.Hyperparameters: • k• distance metric– L

– L

Dataset

Figure 1:Source: Pg 36, ISLP

Effect of k on decision boundary As k increases, the model becomes less flexible.

Figure 2:Source: Pg 39, ISLP

Figure 3:Source: Pg 39, ISLP

Advantages• Very easy to implement• Interpretable– Increases trust in the model– Easy to explain to non-experts Disadvantages• Computationally expensive– n points in training data– n distances have to be computed for each prediction– n distances have to then be sorted• No model is learnt• Training data has to be stored even during prediction• Curse of dimensionality Curse of Dimensionality As dimensions increase, neighborhood information becomes less reliable. 4. Decision Trees4.1. Binary tree • Q

feature

value

• Depth = 3• Nodes– Root node: Q

– Internal nodes: Q

– Leaves: L

The tree is the model. Once the tree has been grown from data, the data can be thrown away.

Depth: The number of edges on the longest path from the root to a leaf.

4.2. Entropy: Node impurity • n

→ number of positive data-points• n

→ number of negative data-points n=n

p → proportion of positive (negative) data-points in a node p=

The entropy of a node is a measure if the node's impurity: H

(

)

=-p

log

(

1-p

)

log

(

1-p

)

4.3. Decision Stump n=n

𝛾=

E=-p

log

(

1-p

)

log

(

1-p

)

IG=E

[

𝛾E

(

1-𝛾

)

]

4.4. Growing a Tree Growing a tree is a recursive and greedy algorithm: • At each node, choose the (feature, value) pair that gives the greatest reduction in entropy or the maximum gain in information.• Choice of values for a given feature– Sort all the values for this feature in the training dataset* Use this set of values (OR)* Use the mid-point between consecutive feature values in this set.• Stop growing the tree when the stopping criterion is met– Default stopping criterion is when a node is completely pure– Other stopping criteria are discussed in the end.• Assign labels to leaf nodes– Label is most frequently occurring label Sample dataset Tree Decision boundary Decision Regions In general, boundaries of the regions are parallel to the axes: Prediction Traverse the tree. When a leaf node is reached, output the label of the leaf node. Overfitting Deep trees have high variance. Mitigation• Minimum samples at leaf node• Maximum depth• Minimum decrease in impurity Advantages • Shallow trees are interpretable• Easy to implement Disadvantages • Simple model with low predictive power• Deep trees have high variance– small changes to the training data result in big changes in the resulting tree 5. References• ISLP: Introduction to Statistical Learning using Python