CS460 Lecture note


ID3 Decision Tree

Dheemanth Reddy Regati (1711053) | Email


Introduction


A Decision Tree is a graphical tree, which is built from a dataset, with the non-leaf nodes representing a column, a leaf node representing an outcome, i.e, a value from the "target column". The outgoing edges of a node represent the possible values that it can have.

ID3, refers to Iterative Dichotomizer 3, and is developed by Ross Quinlan[1]., and is called so, because it iteratively dichotomizes(divides) the data at each iteration. It is a top down greedy approach, meaning that the root will be created first,and that the algorithm optimizes locally at each iteration.

Entopy and Information gain


Entropy can be said to be the amount of randomness that exists in a particulat set of data, while information gain is a measure of reduction in entropy due to a feature, or, how well does a feature separate the "target feature". Entropy is defined as:
Formula for Entropy
Formula for entropy. pi is the fraction of the dataset whose target feature/attribute is 'i', and 'n' is the total possible values that the target feature/attribute can have.

Information Gain is defined as:
Formula for Information Gain
Formula for Information Gain for a particular feature/attribute. Sv is the fraction of the dataset where the feature 'A' is 'v', and |S| is the number of entries in the dataset

Pseudocode/Algorithm


ID3 (Examples, Target_Attribute, Attributes)
        Create a root node for the tree
        If all examples are positive, Return the single-node tree Root, with label = +.
        If all examples are negative, Return the single-node tree Root, with label = -.
        If number of predicting attributes is empty, then Return the single node tree Root,
        with label = most common value of the target attribute in the examples.
        Otherwise Begin
            A ← The Attribute that has maximum Information gain
            Decision Tree attribute for Root = A.
            For each possible value, vi, of A,
                Add a new tree branch below Root, corresponding to the test A = vi.
                Let Examples(vi) be the subset of examples that have the value vi for A
                If Examples(vi) is empty
                    Then below this new branch add a leaf node with label = most common target value in the examples
                Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root
        

Example case

Let's look at the following table, with all values existing in binary, and see how would the decision tree be structured.

Feature A Feature B Feature C Target feature
0 0 0 0
1 1 1 1
1 1 0 0
1 0 1 1
1 1 1 1
0 1 0 0
1 0 1 1
1 0 1 1
0 1 1 1
1 1 0 0
1 1 0 1
0 1 0 0
0 1 1 1
0 1 1 0

Solution:

We know that |S| = 14. Now,

Caluculating the Entropy of the target feature, we get:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99

Now, Information Gain of Feature A would be:
For v = 1, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81

For v =0, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91

⟹ IG(S, A) = Entropy(S) - (S1| / |S|) * Entropy(S1) - (|S0| / |S|) * Entropy(S0)
∴ IG(S, A) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13

Similarly,
IG(S, B) = 0.04
IG(S, C) = 0.4

Since Feature C has the highest Information gain, it will be the root. Now, we calculate IG of both 'A','B' on the set SC1(Subset of S where all C=1).

IG(SC1,A)=0.20
IG(SC1,B)=0.09

So, we would put feature A to the right side of C. Now, since SC0 doesn't have zero entropy, we have to place the only remaining feature, B, to the left side of C. Our current tree looks like this:

Image of the tree currently built
Image of the tree currently built

Since we have exhausted all the features, we will now define the leaf nodes, based on majority, i.e, if more than half of them are 0, we label zero, else we label as 1.
This is how the finished tree looks like:

The final decision tree
The final tree obtained

Advantages and Disadvantages of ID3:

Advantages

Disadvantages

References

[1] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106