Week-7
1. Zero-One Error 2. Linear Classifier 3. KNN 4. Decision Trees 4.1. Binary tree 4.2. Entropy: Node impurity 4.3. Decision Stump 4.4. Growing a Tree 5. References 1. Zero-One Error If f:Rd{1,0}: L(D,f)=1
n
ni=1I[yif(xi)]
This is the average number of errors in prediction by the classifier. 2. Linear Classifier f:Rd{1,-1} f(x)=sign(wTx)=a
1,wTx0
-1,wTx<0
The optimization problem given below is NP hard.
min
w
1
n
ni=1I[yisign(wTxi)]
Using the SSE for classification is not a good idea. The SSE is sensitive to outliers.
min
w
1
n
ni=1(wTxi-yi)2
Besides, modeling classification as a regression problem has another issue. There is no natural ordering among the labels in a classification problem. For example, if the labels are "cat", "dog", "mouse", all permutations of the labels {1,2,3} are equally valid. A regression based approach would however take into account the ordering between {1,2,3}. There is a definite sense in which 3>2>1 and 3-2=2-1=1. However, no such order exists among the labels "cat", "dog" and "mouse". 3. KNNPrediction for a given test-point:Find d(xtest,xi) for all iSort distances in ascending orderFind the labels of the first k pointsReturn the most frequently occurring label Usually an odd number is chosen for k to avoid ties.Hyperparameters: kdistance metricL2L1 Dataset
Figure 1:Source: Pg 36, ISLP
Effect of k on decision boundary As k increases, the model becomes less flexible.
Figure 2:Source: Pg 39, ISLP
Figure 3:Source: Pg 39, ISLP
AdvantagesVery easy to implementInterpretableIncreases trust in the modelEasy to explain to non-experts DisadvantagesComputationally expensiven points in training datan distances have to be computed for each predictionn distances have to then be sortedNo model is learntTraining data has to be stored even during predictionCurse of dimensionality Curse of Dimensionality As dimensions increase, neighborhood information becomes less reliable. 4. Decision Trees4.1. Binary tree Qi:feature<valueDepth = 3NodesRoot node: Q1Internal nodes: Q2,Q3,Q4Leaves: L1,L2,L3,L4,L5 The tree is the model. Once the tree has been grown from data, the data can be thrown away.
Depth: The number of edges on the longest path from the root to a leaf.
4.2. Entropy: Node impurity nP→ number of positive data-pointsnN→ number of negative data-points n=nP+nN p → proportion of positive (negative) data-points in a node p=nP
n
The entropy of a node is a measure if the node's impurity: H(p)=-plog2p-(1-p)log2(1-p) 4.3. Decision Stump n=nL+nR 𝛾=nL
n
E=-plog2p-(1-p)log2(1-p) IG=EP-[𝛾EL+(1-𝛾)ER] 4.4. Growing a Tree Growing a tree is a recursive and greedy algorithm: At each node, choose the (feature, value) pair that gives the greatest reduction in entropy or the maximum gain in information.Choice of values for a given featureSort all the values for this feature in the training dataset* Use this set of values (OR)* Use the mid-point between consecutive feature values in this set.Stop growing the tree when the stopping criterion is metDefault stopping criterion is when a node is completely pureOther stopping criteria are discussed in the end.Assign labels to leaf nodesLabel is most frequently occurring label Sample dataset Tree Decision boundary Decision Regions In general, boundaries of the regions are parallel to the axes: Prediction Traverse the tree. When a leaf node is reached, output the label of the leaf node. Overfitting Deep trees have high variance. MitigationMinimum samples at leaf nodeMaximum depthMinimum decrease in impurity Advantages Shallow trees are interpretableEasy to implement Disadvantages Simple model with low predictive powerDeep trees have high variancesmall changes to the training data result in big changes in the resulting tree 5. ReferencesISLP: Introduction to Statistical Learning using Python