Decision Trees for Linguists Pt. 1 – A super simple example

By | September 1, 2017

Continued in Part 2.


If I have a basked of apples, and only apples, then it’s considered “pure.” If I put a banana in the basket, I can no longer call it a basket of bananas, and it’s now considered “impure.” Purity is a measure of how varied a set is. A purity of 1 means that it’s totally pure. If I keep adding bananas until the number of apples and bananas are the same, the purity drops to 50%, or 0.5. If I continue adding bananas, the purity goes back up, but this time it’s because there’s a lot of bananas rather than a lot of apples.In other words, if there are only two classes, the lowest possible purity value is 0.5. The purity never goes below 1/2 because the least pure the basket can be is half bananas and half apples.


We can also look at the purity of features. For example, color is a feature. If my basket has a bunch of red objects and some yellow objects mixed in, then I know the basket contents aren’t pure. We can identify objects by their features. E.g., a banana is usually yellow and an apple is usually red. We can then record the color of each object in my basket and assign a 1 for yellow and a 0 for not yellow. Consequently, we assign a 1 all the bananas’ color feature and a 0 to all the apples’. Color may be all we need to make a distinction between apples and bananas. This separation is called “classification,” which is what decision trees do well.

We can document other features as well. Say we had a basket of 11 pieces of fruit — all either apples or bananas. Color is great, but what if we encounter a green apple or a green banana? We can’t use the “yellow” feature alone. We need more features. We do this by recording other properties that each piece of fruit has. These properties, or features, are descriptors that we think are useful in determining what kind of fruit it is. The process of getting these properties is called “feature extraction.” We also record its class (apple or banana). This gives us a row of properties and a class for each piece of fruit. Since there are 11 pieces, we have 11 rows in our dataset.

Creating the data

We record our observations for each piece of fruit, writing down the class (apple or banana), its roundness, its color, and its length. We may have also unintentionally recorded some useless features like sweetness or if it grows on a tree. Getting only the useful features is called “feature selection.” This is often just a subset of available features. Even though decision trees do a good job at selecting which features are important as it’s being trained, we’ll just take out the “sweetness” and “grows on tree” features.

On the left  is our entire dataset of 11 observations (rows), three features, and the class (label). “>5 inches” means “length is greater than 5 inches.”  Technically, it has four features (the class is really a feature as well), but we usually don’t refer to the class as a feature. Instead, it’s often called the “expected outcome” or of course, the “class”. Each time the model fails to reach this outcome, it’s considered to be an “error.” This happens with outliers, like a yellow apple, or a model that is either too general  or too specific. I won’t go into these details, but common sense tells us that a good model will be trained using more than 11 examples (pieces of fruit), and have many more features. E.g., hardness and weight may add to our accuracy. “Accuracy” is the rate or error. We want to keep the error rate, often just called “error” fairly low.

Training the model

Not only did we make the dataset ourselves, but we’ll make the decision tree ourselves. The class is what we’re trying to guess as we train our decision tree. Training involves making guesses and seeing which ones are best. With decision trees, we find the best ones then the second-best, then the third, etc. This makes them hierarchical (hence the name “decision tree”). We make guesses by looking at all the available features. We find out how successful it is at determining the class, and compare that to how successful the other features are.

“Supervised learning” is basically when people decide which features are used in a model, or what the outcome should be. We decide which feature is the best at predicting the outcome feature (class) by comparing how pure each one is at each node (leaf) in the tree.

Let’s look at the first feature on the table on the left: roundness. If the item is round, then it is assigned a “yes”, otherwise, a “no”. Since apples are usually fairly round, this value is usually a “yes”; since bananas are almost never round, this value for bananas is almost never “no”. In computer science and math, we often represent a “yes” or “true” as a 1 and a “no” or “false” as a 0.

On the left, you can see every apple in our sample was recorded as being round, and no banana was recorded as being round. This makes the ‘round’ variable pure! Why is this important? It means we can use this feature to fairly reliably sort the fruit in our basket into a pile of bananas and a pile of apples.

But what it we encounter a new basket of apples and bananas. Our training data came only from the first basket. This new basket happens to have an apples that isn’t very round. In this case, we must rely on the other features, even though they aren’t as good at predicting the kind (class) of fruit. Let’s look at what we have in our training data from the first basket:

On the left, you can see all of the bananas in our basket were yellow, but two of the apples were also yellow. This means that, if we go by the color feature alone, we will sometimes misclassify an apple or banana we give it. However, this won’t happen too often, and if we guess the class using this feature alone, it will still perform above chance.

Keep in mind that the two observations where the apples were recorded as being yellow aren’t errors: there actually were two yellow apples in our dataset. In very large datasets, features are rarely pure, and it’s not so obvious which ones are more important unless we use statistics, which we’ll do soon.


The length feature on the left has only one impure observation where an apple was recorded as being over 5 inches in length (that’s a huge apple!). Consequently, this ‘>5inches’ feature is considered more pure than the ‘yellow’ feature.

Purity can be measured, and the algorithm will use this measure to separate the apples from the oranges. A decision tree will be used to make decisions. We must give it those decisions, based on how pure a feature is. The higher its purity, the more important it is. Since we already know that ‘round’ is a very important feature (how many bananas are round?), we’ll use that for the first decision.


Creating branches

The first decision will be: “Is this piece of fruit round?” If the answer is yes, it will put the piece of fruit into the “round fruit” pile. If it’s not, it will put the piece of fruit into a “non round” pile. Notice that we asked a “yes or no” question*. We only use features once, so we can basically ignore the ‘round’ feature from now on. Assuming we may encounter a non-round apple some day, we need other features to help us out. We’ll look at the second-most important feature, which is ‘>5inches’. This will separate each pile into two piles (creating 4 piles) based on whether or not the pieces of fruit are longer than 5 inches. The pile that creates the biggest group is where the split will occur.

Let’s look at how this works visually:

The “yes” branch is terminal, which means it’s the last branch. However, the “no” branch still has a decision to make: which feature is more pure between color and length. We’ll use the simple equation below:

9/11 is more pure than 6/11. We can use more complex equations later on but a simple fraction will do for now, where the highest fraction wins. We’ll do this in our heads, but a max function would normally usually be used, which gets the highest value of a set. A higher fraction means the feature is more important, so that’s where we’ll make the next split (branch) in our tree. We keep repeating these comparisons as we create new splits. Splits are basically nodes that ask a question that can be answered with a “yes” or “no.” This entire process is called an “algorithm.” An algorithm is the method we use to reach our model’s goal. A “model” is the finished product, such as a decision tree. Here’s the finished tree structure:

Voila! There’s our decision tree. Unlike other models used for classification, you can actually see the structure of a decision tree. Generally, the lower the nodes are in the tree, the less important they are. In fact, sometimes they do more harm than good when classifying, which we’ll get to later. 

Testing our model

Let’s test our tree with some new data to see how well it classifies. Imagine that we get a new basket of, say, 9 pieces of fruit and measure their roundness, color and length. We then use these “yes” or “no” results with our tree. We use this data on our model. The top split is roundness, so we check to see if the first observation is round. There’s a “yes” in that field. This branch ends at “apple,” so the model concludes that the first piece of fruit is an apple. We check this against our class in our test data. It says “apple,” which is the expected outcome. So far, our model is working perfectly!

Normally, the test data is smaller than the training data because we just need a rough idea of the model’s performance. This is known as “evaluation” and it is measured by the error rate. The error rate is basically the difference between the output we get and the output we expect. Each misclassification, such as when a banana is classified as an apple, will increase the model’s error rate, often just called “error.” The error is the difference between the observed outcome and the expected outcome. We can use basic math to calculate this, but usually more complex measures are used, such as Mean Squared Error. So far our error rate is 0. 

With our test data (left), we know the expected outcome (class), just as we do in training data. In fact, the algorithm ignores the class during testing. If it did, it would be cheating — like having the answers to an exam. It turns out there are no surprises in the dataset above: all apples are round and red; all bananas are yellow and long. The model ends up with perfect classification — an error of 0. How successful the model is in classifying is measured by how many it gets wrong. Our test got an error of 0, but the real world is noisy  and has more complex kinds of variables, For example, some apples may be green, and some bananas may be less than 5 inches long. In statistics, we refer to this as “variation”. Such variation will increase the model’s error, but that’s fine; even humans make mistakes.

Improving our performance

When getting our data, we want to be as random as possible. This helps keep our model general. We also want to train the model with enough features to make it accurate on the training data, but not too many features.

You can improve the model’s performance a variety of ways. This may include getting more training and testing data, using more or less features, using different methods to split branches, using different features, etc. Try adding and removing features to see if you get lower error.

Extra stuff that you don’t need to know

We’ll next look at a better way to calculate where to split the training data using a measure called “information gain”, which is more complicated than just a mere ratio. We’ll also look at a more complicated model that uses many trees, called “random forests”. It not only inherently reduces overfitting, but it also can output not only the class, but the most important features used to deduce the class. This can help you reduce features, which is known as “feature selection.” Decision trees do this so well that they’re often used in other kinds of models when deciding which features to use.

*A yes-or-no question is similar to a hypothesis. Instead of asking, “roundness more important than length?” we can make an assertion that a “roundness is more important than length” and then test it mathematically. We use a function to test our hypothesis and if it returns a value that doesn’t meet our threshold, it fails. If we fail to disprove the assertion, then we can conclude that it is not round. This is “hypothesis testing.”

We created a basic concept of a decision tree. To improve the tree, you would need a better splitting function, likely testing information gain or gini coefficients, and a method to split continuous data (we’ve been working with binary data). Even with these improvements, this algorithm has an inherent flaw: it’s “greedy.” This means that when the decision goes down a certain path, the other ones are ignored — only a few nodes steal all the attention during decision making. A more complex model that uses many variants of the same tree compensates for this. Because it uses many trees, and they’re randomly created, it’s called a “random forest” model. Random forests perform better than many — if not most — classification algorithms, and can even be used for high-quality feature selection. Yet another problem with decision trees are features with many unique variables, such as student numbers or driver’s licences. These usually contribute nothing to classification but do a lot to improperly bias the model. Of course, you likely wouldn’t use such features anyway. For example, common sense tells you that a social security number may be totally useless when determining a person’s sex. Another problem is “overfitting.” This is when you either use too many features, or train your model so much on a particular dataset that it fits too well to that data. Consequently, when it comes across new data, it may perform poorly than it would if the model was trained less and was more generalized. A simple way to make a model more generalized is by removing the last few nodes, called “pruning.”

Share this article