CS460 Lecture note

Decision Tree (Gini Impurity)

Deepak Kumar

Decision Tree

Decision Trees are a non-parametric (Meaning no underlying assumptions about the distribution of the data. So, the model is constructed based on the observed data.) supervised learning (Means label is provided to model for learning.) method used for classification (If target variable is discrete.) and regression (If target variable is conitinuous). In simplest terms this model predicts the value of a target variable by learning simple decision rules inferred from the data features. Structure of Decision Tree is like where each internal node denotes a test on an feature, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Gini Impurity

Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of data, if that new instance were randomly classified according to the distribution of class labels from the data set. So, it helps us to identify which feature is better suited to split or test (here test is the same test as used in "Decision Tree" section) the data.

Formula

If we have 𝙲 total classes and 𝑝(i) is the probability of picking a datapoint with class ii, then the Gini Impurity (𝑮) is calculated as:

Example^[1]

Let's take the samle data as given in following image:

First, we calculate for the feature Emotion

Sick Gini impurity = 2 * (2/3) * (1/3) = 0.444
NotSick Gini Impurity = 2 * (3/5) * (2/5) = 0.48
Weighted Gini Split = (3/8) * SickGini + (5/8) NotSickGini = 0.4665

Now, for Temperature

Temp over 100 impurity = 2 * (3/4) * (1/4) = 0.375
Temp under Impurity = 2 * (3/4) * (1/4) = 0.375
Weighted Gini Split = (4/8) * TempOverGini + (4/8) * TempUnderGini = 0.375

On the basis of lower Weighted Gini Split, we can say that Temperature is better feature compare to Emotion for predicting whether to stay at home or not.

References

Gini Impurity Measure – a simple explanation using python