# Decision Trees

### Entropy

*should be m+n as a denominator.

## Multi-class Entropy

Last time, you saw this equation for entropy for a bucket with mm red balls and nn blue balls:

We can state this in terms of probabilities instead for the number of red balls as p_1p 1 and the number of blue balls as p_2p 2 :

The minimum value is still 0, when all elements are of the same value. The maximum value is still achieved when the outcome probabilities are the same, but the upper limit increases with the number of different outcomes. (For example, you can verify the maximum entropy is 2 if there are four different possibilities, each with probability 0.25.)

**Cross-Entropy:**

Now, what if each outcome’s actual probability is *pi *but someone is estimating probability as *qi. *In this case, each event will occur with the probability of *pi *but surprisal will be given by *qi *in its formula (since that person will be surprised thinking that probability of the outcome is *qi*). Now, weighted average surprisal, in this case, is nothing but cross entropy(c) and it could be scribbled as:

### Log Loss vs Cross-Entropy

Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:

p = set of true labels

q = set of prediction

y = true label

ŷ = predicted prob

Which is exactly the same as log loss!

### Information Gain

Note that the child groups are weighted equally in this case since they're both the same size, for all splits. In general, the average entropy for the child groups will need to be a *weighted* average, based on the number of cases in each child group. That is, for mm items in the first child group and nn items in the second child group, the information gain is:

Entropy(parent) for the first node would be the entropy of all the data. Ex:

## Hyperparameters for Decision Trees

In order to create decision trees that will generalize to new problems well, we can tune a number of different aspects about the trees. We call the different aspects of a decision tree "hyperparameters". These are some of the most important hyperparameters used in decision trees:

#### Maximum Depth

The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length kk can have at most 2^k2k leaves.

#### Minimum number of samples to split

A node must have at least `min_samples_split`

samples in order to be large enough to split. If a node has fewer samples than `min_samples_split`

samples, it will not be split, and the splitting process stops.

However, `min_samples_split`

doesn't control the minimum size of leaves. As you can see in the example on the right, above, the parent node had 20 samples, greater than `min_samples_split`

= 11, so the node was split. But when the node was split, a child node was created with that had 5 samples, less than `min_samples_split`

= 11.

#### Minimum number of samples per leaf

When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.

This number can be specified as an integer or as a float. If it's an integer, it's the minimum number of samples allowed in a leaf. If it's a float, it's the minimum percentage of samples allowed in a leaf. For example, 0.1, or 10%, implies that a particular split will not be allowed if one of the leaves that results contains less than 10% of the samples in the dataset.

If a threshold on a feature results in a leaf that has fewer samples than `min_samples_leaf`

, the algorithm will not allow *that* split, but it may perform a split on the same feature at a *different threshold*, that *does* satisfy `min_samples_leaf`

.

## Decision Trees in sklearn

In this section, you'll use decision trees to fit a given sample dataset.

Before you do that, let's go over the tools required to build this model.

For your decision tree model, you'll be using scikit-learn's `Decision Tree Classifier`

class. This class provides the functions to define and fit the model to your data.

In the example above, the `model`

variable is a decision tree model that has been fitted to the data `x_values`

and `y_values`

. Fitting the model means finding the best tree that fits the training data. Let's make two predictions using the model's `predict()`

function.

The model returned an array of predictions, one prediction for each input array. The first input, `[0.2, 0.8]`

, got a prediction of `0.`

. The second input, `[0.5, 0.4]`

, got a prediction of `1.`

.

#### Hyperparameters

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

`max_depth`

: The maximum number of levels in the tree.`min_samples_leaf`

: The minimum number of samples allowed in a leaf.`min_samples_split`

: The minimum number of samples required to split an internal node.

For example, here we define a model where the maximum depth of the trees `max_depth`

is 7, and the minimum number of elements in each leaf `min_samples_leaf`

is 10.

### Decision Tree Quiz

In this quiz, you'll be given the following sample dataset, and your goal is to define a model that gives 100% accuracy on it.

The data file can be found under the "data.csv" tab in the quiz below. It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

The data will be loaded for you, and split into features `X`

and labels `y`

.

#### You'll need to complete each of the following steps:

**1. Build a decision tree model**

Create a decision tree classification model using scikit-learn's

`DecisionTree`

and assign it to the variable`model`

.

**2. Fit the model to the data**

You won't need to specify any of the hyperparameters, since the default ones will yield a model that perfectly classifies the training data. However, we encourage you to play with hyperparameters such as

`max_depth`

and`min_samples_leaf`

to try to find the*simplest possible*model.

**3. Predict using the model**

Predict the labels for the training set, and assign this list to the variable

`y_pred`

.

**4. Calculate the accuracy of the model**

For this, use the function sklearn function

`accuracy_score`

. A model's**accuracy**is the fraction of all data points that it correctly classified.

When you hit **Test Run**, you'll be able to see the boundary region of your model, which will help you tune the correct parameters, in case you need them.

* Note:* This quiz requires you to find an accuracy of 100% on the training set. This is like memorizing the training data! A model designed to have 100% accuracy on training data is unlikely to generalize well to new data. If you pick very large values for your parameters, the model will fit the training set very well, but may not generalize well. Try to find the smallest possible parameters that do the job—then the model will be more likely to generalize well. (This aspect of the exercise won't be graded.)

## Example in Python

Last updated