Entropy and Information Gain
Continued from Part 1.
Let’s start with some definitions.
Tree: A hierarchical structure of nodes and connections between those nodes (branches) with parent-child relationships. Child nodes have parent nodes, which in turn may have their own parents nodes. The highest node is the root node.
Decision tree: a flow-chart-like structure where each internal (non-leaf) node denotes a test on an attribute, and each branch represents the outcome of a test. Each leaf (or terminal) node holds a class label.
Computer model: A computer-based model is a computer program that is designed to simulate what did or what could happen in a situation.
We made a model in Part 1, but it wasn’t a computer model. You can think of a model as a simple or smaller example of something larger or more complex. Models may be based on real objects or on theories.
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Tree models where the target variable can take a discrete set of values are called classification trees.
Information Gain: a metric used to measure homogeneity. It is used to choose an attribute on which the data has to be split a node in a decision tree — where the split has the highest purity among the daughters.
Information entropy: the average rate at which information is produced by a randomly determined source of data. The measure of heterogeneity in a collection. The lower the entropy, the higher the purity (homogeneity).
Constructing a decision tree involves finding the attribute that returns the highest information gain (i.e., the most homogeneous branches) and repeating this until the tree is complete. The topmost node in a tree is the root node. The first split is done at the root, and it’s the split with the highest information gain.
An example of a randomly-determined (stochastic) source of data is a dice roll. Entropy measures the average information content you can expect to get if an event occurs, so casting a dice has more entropy than tossing a coin because each outcome of the dice has smaller probability than each outcome of the coin.
Let’s look at our distinctive feature graph. Notice that the /ɑ/ (“father a”) has three distinctive features: [+syllabic +back +low].
Given a phoneme, we can use a decision tree to figure out if a phoneme is one of two classes: /ɑ/ or not /ɑ/ (¬/ɑ/) using our three given features.
You can traverse the tree downward, answering each hypothesis with a yes or no until you arrive at the appropriate class of /ɑ/ or ¬/ɑ/.
Imagine you knew only the features and not the phoneme (class). Given the feature matrix [+syllabic +back +high], what is the outcome? If you traverse the tree, you see that the criteria for +syllabic? is true, or “yes”, so you proceed down the “yes” branch. You then see that +back? Is true, so you proceed down the “yes” branch. However, you see that +low? Is false, so you proceed down the “no” branch and find that the feature matrix represents the ¬/ɑ/ class. We’ve just used a decision tree to classify a phoneme!
Side note: we’re just using two classes, so this tree is a binary classifier.
Let’s use the equations above to calculate the splits. Information gain is used to decide which feature to split on at each step in building the tree. At each step we choose the split that results in the purest daughter nodes. Every available split must be tested to find which gives the purest result.
Entropy is defined below:
Where p1, p2, . . . are fractions of examples in classi. They add up to 1 and represent the percentage of each class present in the child node that results from a split in the tree.
Let’s use the following data to calculate entropy:
There are actually 3 features (we’re not counting “class” as a feature, since it’s the answer), so J = 3.
To get the entropy of the “height” node, get the counts of the +high and the counts of +low.
Divide both by the count of all observations. Since there are 4 observations (rows), we divide each by 4:
P+high = 2/4 = 0.5
P+low= 2/4 = 0.5
We now have to use the logarithm. A calculator will tell you that log2(0.5) – log2(0.5) = 1.
Notice we have two class labels, so we just have to get the log of two values. Also notice that with 2 classes, when the two classes are evenly split (0.5), which gives us the maximal entropy of 1.
We now know that the entropy of the “height” node is 1.
For each example in the “height” column, we’ll look at the corresponding syllabic value in that row. Our split has just two branches because there are just two values for the “height” feature: +high and +low.
*When calculating entropy,1/1 and 0/0 are actually 0 because every example has the same value, making it totally pure, so there is no entropy.
We don’t need to calculate the entropy of the -syllabic branch because we already know it’s 0. We just need to calculate the entropy for P+high and P+low for the +syllabic branch.
entropy = -(2/3)log(2/3) – (1/3)log(1/3) = 0.9184
The entropy of the children is the weighted average of the two branches.
entropy(children) = (3/4)(0.9184) + (1/4)(0)
We’ll split our data using information gain. It’s the entropy of the parent minus the weighted sum of the entropy of its children:
The split will try to maximize the information gain. Let’s continue by using our data above to get the information gain from the “syllabic” feature. Let’s assume that the parent is “height”, which has an entropy of 1.
syllabic Information gain = 1 – (3/4)(0.9184) + (1/4)(0) = 0.3112
We subtracted the entropy of the children from the entropy of the parent to get the information gain for the syllabic split. We have to compare this to the backness split, so we have to use the same method above to calculate the backness information gain.
We repeat this process, recursively comparing information gain. A loss function is determines which split is better. Once these features are used for a split, they are no longer candidates for further splits.