CS460 Lecture note

ID3 Decision Tree

Dheemanth Reddy Regati (1711053) | Email

Introduction

A Decision Tree is a graphical tree, which is built from a dataset, with the non-leaf nodes representing a column, a leaf node representing an outcome, i.e, a value from the "target column". The outgoing edges of a node represent the possible values that it can have.

ID3, refers to Iterative Dichotomizer 3, and is developed by Ross Quinlan^[1]., and is called so, because it iteratively dichotomizes(divides) the data at each iteration. It is a top down greedy approach, meaning that the root will be created first,and that the algorithm optimizes locally at each iteration.

Entopy and Information gain

Entropy can be said to be the amount of randomness that exists in a particulat set of data, while information gain is a measure of reduction in entropy due to a feature, or, how well does a feature separate the "target feature". Entropy is defined as:

Formula for Entropy — Formula for entropy. p_i is the fraction of the dataset whose target feature/attribute is 'i', and 'n' is the total possible values that the target feature/attribute can have.

Information Gain is defined as:

Formula for Information Gain for a particular feature/attribute. S_v is the fraction of the dataset where the feature 'A' is 'v', and |S| is the number of entries in the dataset

Pseudocode/Algorithm


ID3 (Examples, Target_Attribute, Attributes)
        Create a root node for the tree
        If all examples are positive, Return the single-node tree Root, with label = +.
        If all examples are negative, Return the single-node tree Root, with label = -.
        If number of predicting attributes is empty, then Return the single node tree Root,
        with label = most common value of the target attribute in the examples.
        Otherwise Begin
            A ← The Attribute that has maximum Information gain
            Decision Tree attribute for Root = A.
            For each possible value, v_i, of A,
                Add a new tree branch below Root, corresponding to the test A = v_i.
                Let Examples(v_i) be the subset of examples that have the value v_i for A
                If Examples(v_i) is empty
                    Then below this new branch add a leaf node with label = most common target value in the examples
                Else below this new branch add the subtree ID3 (Examples(v_i), Target_Attribute, Attributes – {A})
    End
    Return Root

Example case

Let's look at the following table, with all values existing in binary, and see how would the decision tree be structured.

Feature A	Feature B	Feature C	Target feature
0	0	0	0
1	1	1	1
1	1	0	0
1	0	1	1
1	1	1	1
0	1	0	0
1	0	1	1
1	0	1	1
0	1	1	1
1	1	0	0
1	1	0	1
0	1	0	0
0	1	1	1
0	1	1	0

Solution:

We know that |S| = 14. Now,

Caluculating the Entropy of the target feature, we get:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99

Now, Information Gain of Feature A would be:
For v = 1, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81

For v =0, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91

⟹ IG(S, A) = Entropy(S) - (S₁| / |S|) * Entropy(S₁) - (|S₀| / |S|) * Entropy(S₀)
∴ IG(S, A) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13

Similarly,
IG(S, B) = 0.04
IG(S, C) = 0.4

Since Feature C has the highest Information gain, it will be the root. Now, we calculate IG of both 'A','B' on the set S_C1(Subset of S where all C=1).

IG(S_C1,A)=0.20
IG(S_C1,B)=0.09

So, we would put feature A to the right side of C. Now, since S_C0 doesn't have zero entropy, we have to place the only remaining feature, B, to the left side of C. Our current tree looks like this:

Image of the tree currently built

Since we have exhausted all the features, we will now define the leaf nodes, based on majority, i.e, if more than half of them are 0, we label zero, else we label as 1.
This is how the finished tree looks like:

The final tree obtained

Advantages and Disadvantages of ID3:

Advantages

Understandable Prediction rules.
Relatively short time taken to build tree.
Tree can be pruned if all data points to single class.
Doesn't require all attributes in some cases.

Disadvantages

Causes over fitting of data/high variance.
Doesn't prune redundant attributes by itself (as seen in example, where feature B is redundant).
Uses greedy algorithm, so may not produce optimal tree always.

References

[1] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106