Decision Trees
Last updated
Last updated
*should be m+n as a denominator.
Last time, you saw this equation for entropy for a bucket with mm red balls and nn blue balls:
We can state this in terms of probabilities instead for the number of red balls as p_1p 1 and the number of blue balls as p_2p 2 :
The minimum value is still 0, when all elements are of the same value. The maximum value is still achieved when the outcome probabilities are the same, but the upper limit increases with the number of different outcomes. (For example, you can verify the maximum entropy is 2 if there are four different possibilities, each with probability 0.25.)
Cross-Entropy:
Now, what if each outcome’s actual probability is pi but someone is estimating probability as qi. In this case, each event will occur with the probability of pi but surprisal will be given by qi in its formula (since that person will be surprised thinking that probability of the outcome is qi). Now, weighted average surprisal, in this case, is nothing but cross entropy(c) and it could be scribbled as:
Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:
p = set of true labels
q = set of prediction
y = true label
ŷ = predicted prob
Which is exactly the same as log loss!
Note that the child groups are weighted equally in this case since they're both the same size, for all splits. In general, the average entropy for the child groups will need to be a weighted average, based on the number of cases in each child group. That is, for mm items in the first child group and nn items in the second child group, the information gain is:
Entropy(parent) for the first node would be the entropy of all the data. Ex:
In order to create decision trees that will generalize to new problems well, we can tune a number of different aspects about the trees. We call the different aspects of a decision tree "hyperparameters". These are some of the most important hyperparameters used in decision trees:
The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length kk can have at most 2^k2k leaves.
A node must have at least min_samples_split
samples in order to be large enough to split. If a node has fewer samples than min_samples_split
samples, it will not be split, and the splitting process stops.
However, min_samples_split
doesn't control the minimum size of leaves. As you can see in the example on the right, above, the parent node had 20 samples, greater than min_samples_split
= 11, so the node was split. But when the node was split, a child node was created with that had 5 samples, less than min_samples_split
= 11.
When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.
This number can be specified as an integer or as a float. If it's an integer, it's the minimum number of samples allowed in a leaf. If it's a float, it's the minimum percentage of samples allowed in a leaf. For example, 0.1, or 10%, implies that a particular split will not be allowed if one of the leaves that results contains less than 10% of the samples in the dataset.
If a threshold on a feature results in a leaf that has fewer samples than min_samples_leaf
, the algorithm will not allow that split, but it may perform a split on the same feature at a different threshold, that does satisfy min_samples_leaf
.
In this section, you'll use decision trees to fit a given sample dataset.
Before you do that, let's go over the tools required to build this model.
For your decision tree model, you'll be using scikit-learn's Decision Tree Classifier
class. This class provides the functions to define and fit the model to your data.
In the example above, the model
variable is a decision tree model that has been fitted to the data x_values
and y_values
. Fitting the model means finding the best tree that fits the training data. Let's make two predictions using the model's predict()
function.
The model returned an array of predictions, one prediction for each input array. The first input, [0.2, 0.8]
, got a prediction of 0.
. The second input, [0.5, 0.4]
, got a prediction of 1.
.
When we define the model, we can specify the hyperparameters. In practice, the most common ones are
max_depth
: The maximum number of levels in the tree.
min_samples_leaf
: The minimum number of samples allowed in a leaf.
min_samples_split
: The minimum number of samples required to split an internal node.
For example, here we define a model where the maximum depth of the trees max_depth
is 7, and the minimum number of elements in each leaf min_samples_leaf
is 10.
In this quiz, you'll be given the following sample dataset, and your goal is to define a model that gives 100% accuracy on it.
The data file can be found under the "data.csv" tab in the quiz below. It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.
The data will be loaded for you, and split into features X
and labels y
.
1. Build a decision tree model
Create a decision tree classification model using scikit-learn's DecisionTree
and assign it to the variablemodel
.
2. Fit the model to the data
You won't need to specify any of the hyperparameters, since the default ones will yield a model that perfectly classifies the training data. However, we encourage you to play with hyperparameters such as max_depth
and min_samples_leaf
to try to find the simplest possible model.
3. Predict using the model
Predict the labels for the training set, and assign this list to the variable y_pred
.
4. Calculate the accuracy of the model
For this, use the function sklearn function accuracy_score
. A model's accuracy is the fraction of all data points that it correctly classified.
When you hit Test Run, you'll be able to see the boundary region of your model, which will help you tune the correct parameters, in case you need them.
Note: This quiz requires you to find an accuracy of 100% on the training set. This is like memorizing the training data! A model designed to have 100% accuracy on training data is unlikely to generalize well to new data. If you pick very large values for your parameters, the model will fit the training set very well, but may not generalize well. Try to find the smallest possible parameters that do the job—then the model will be more likely to generalize well. (This aspect of the exercise won't be graded.)