A Decision Tree is a graphical tree, which is built from a dataset, with the non-leaf nodes representing a column, a leaf node representing an outcome, i.e, a value from the "target column". The outgoing edges of a node represent the possible values that it can have.
ID3, refers to Iterative Dichotomizer 3, and is developed by Ross Quinlan[1]., and is called so, because it iteratively dichotomizes(divides) the data at each iteration. It is a top down greedy approach, meaning that the root will be created first,and that the algorithm optimizes locally at each iteration.
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that has maximum Information gain
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root
Let's look at the following table, with all values existing in binary, and see how would the decision tree be structured.
Feature A | Feature B | Feature C | Target feature |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
1 | 1 | 0 | 0 |
1 | 0 | 1 | 1 |
1 | 1 | 1 | 1 |
0 | 1 | 0 | 0 |
1 | 0 | 1 | 1 |
1 | 0 | 1 | 1 |
0 | 1 | 1 | 1 |
1 | 1 | 0 | 0 |
1 | 1 | 0 | 1 |
0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 |
0 | 1 | 1 | 0 |
Caluculating the Entropy of the target feature, we get:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99
For v =0, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91
Similarly,
IG(S, B) = 0.04
IG(S, C) = 0.4
Since Feature C has the highest Information gain, it will be the root. Now, we
calculate IG of both 'A','B' on the set SC1(Subset of S where all C=1).
IG(SC1,A)=0.20
IG(SC1,B)=0.09
So, we would put feature A to the right side of C. Now, since SC0 doesn't have zero entropy, we have to place the only remaining feature, B, to the left side of C. Our current tree looks like this:
Since we have exhausted all the features, we will now define the leaf nodes, based on majority, i.e, if more than half
of them are 0, we label zero, else we label as 1.
This is how the finished tree looks like:
[1] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106