Class 1 & 2 - Random Forests

Lesson 1

Jupyter notebooks shortcuts

  • ? : documentation about the function

  • ?? : the actual source code

  • shift+tab : parameters

  • shift+tab x2: source code

  • shift+tab x3: opens a windown in the same window about the documentation

  • tab : complete with available variable

  • !ls : !+any terminal code

  • !ls + {python variable} - type any python code between brackets

Saving your progress with feather

  • os.makedirs('tmp', exist_ok=True)

  • df.to_feather('tmp/notebook_name')

Read feather format

  • df = pd.read_feather('tmp/notebook_name')

Python syntax

# Print
name = 'julien' f'Hello {name.upper()}' -> Hello JULIEN

# DF
df.head().transpose() - inverse the df

# CSV - parsedates to convert to datetime
read_csv('filename', parse_dates = ['col_name'])

# time
%time # will tell you how long things take

# np.stack()
concatenate together on a new axis.  

# re - regular expressions

# Get a random sample - good trick to know. Using np.random.choice WORKS TOO
np.random.permutation(len(y))[:n_sample] - # pass an int to permutation

#Create a dataset:
# Create some evenly spaced data between start and stop: 
x = np.linespace(0,1) (by default n=50)

# y
y = np.random.uniform(0.2, -0.2, x.shape) # tell uniform the shape. We want x.shape

# visualize
plt.scatter(x,y)

# Change vector shape: (50,) to a matrix: shape: (50,1). x.shape in the x above
x.shape # (50,)
x[:,None] # (50,1) - this is exactly the same as x[None]. 

x[...:None] # same but it will add a dimension to the end. 

machine learning tips:

datetime object : you can split them into multiple cols because there is a lot of info in there: Year, weekday, weekend, month, start_of_month, etc. Use python's datetime attributes to get extra info: datetime.dt.attributes. Ex: datecol.dt.week

categories: access attributes with category.cat.attribute

alternative to one_hot_encoding (just for random forests?? might not work for linear models): use python's category class, then switch the cat values to their category codes (in fast.ai that is proc_df(df, col) - which uses numericalize(df, .. ) - which is also from fast.ai : if it's not numeric, df[name] = col.cat.codes+1 (to start at 0 - cat codes has a value of -1 for missing values).

How to reorder a category

f_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], 
                                   ordered=True, inplace=True)

Some Proba definitions to answer how big our validation set should be

If we have a binomial distribution with n and p, the mean is n*p.

std dev : n* p * (1-p)

std error : if we run a bunch of trials, each time getting a mean, what is the std dev of the mean: stddev / sqrt(n). We can use this for the score (ex: accuracy) of our validation set. When we train many times, what's the standard error of the validation set.

So we can train our model 5 times, with the same hyper params, and check the std dev.

So the size of val set will depend on how common the least probably class is and our accuracy.

R^2

Can take any value that is less than 1. Less than 0 means your model is worse than if the model was the mean.

Instead of remembering the formula - remember the mean of it. It's the ratio between how good your model is VS how good the model of the mean is.

Difference between Validation set VS test set

If we use just 1 holdout set for hyperparameter tuning (validation set), we may end up overfitting the hyperparams for that set, or overfitting that validation set. We could have gotten lucky with the validation set and gotten a good result. So we want a second holdout set where we can say Ok i'm finished, and just one right at the end we're going to see if it works.

We really have to remove the test set from the data.

Time and validation sets

when we are dealing with time series, we want to split our dataset so that the validation and testing set does not have the same dates as the training set. We want to know how it's going to do on some data and dates its never seen before. We want a training set that represents a date range, and valid = another date range.

This is why in the notebook we don't take random sample of the data, because we wouldn't be getting the most recent dates for our validation set. we'd get random dates.

train_cats(df) function

Important to note that this is creating this for the training dataset. Each category name is mapped to an integer behind the scenes. We can use apply_cats in our validation or testing dataset to get the same order. Very important to remember.

def train_cats(df):
    for n,c in df.items():
        if c.dtype == 'O': df[n] = c.astype('category').cat.as_ordered()

When we do this automatically the nans turn into -1.

Validation set and time based dataset

In general, if you are dealing with timeseries dataset, you want your test set to be different time periods, otherwise, you won't really be predicting the future. That is why when we are doing the train valid split, we aren't picking random samples for these datasets.

Random Forests

Downside: They do not extrapolate at all outside the range that we've seen in training.

Forest: chooses the category that predicts best first, and also a split point. To do this is goes through every single category, as well as well the different values of that category to find the split point. You know which is best by looking at a value like MSE, which will tell you the average error. So you have a benchmark to compare. ( we could have used something else than the average (MSE), but in practice MSE works very well.

We have a single number that tells you how good the split is. It's the weighted average of the MSE of the 2 groups it creates.

Key insight: Random forests uses a technique called bagging. We create multiple trees, which independently are crappy models. But taken together they form a good model. This is because the trees are independent and so their errors are independent from each other too.

Of course you want each estimator (or tree) to be as predictive as possible too. You want predicted but poorly correlated trees.

Trees are independant when they use different data points. If I build 1000 trees each with only 10 data points - they will be very different, and so their errors will be too. If I build 1000 trees but each with n-1 data points (all but 1 data points), they are going to be very similar so they won't be independent. This means that their errors will be correlated. However each individual tree will be a better predictor. There is a balance between the two that we need. In the latter : it won't generalize very well.

Recent research has shown than the most important thing is to have uncorrelated trees, rathers than each tree being more predictive. Also, if you have less good trees, you just need more trees.

For example, in scikit learn there is a tree class that is called ExtraTreesClassifier. It creates trees with variables that are more random. It's faster, although less predictive, but because it's faster it can make more trees and end up with a better model that generalizes more.

Random Forest - Code

# each tree is stored in this attribute called 
estimators_

# example: get list of arrays of predictions 
preds = np.stack([t.predict(X_valid for i in m.estimarors_])
# returns 

Sampling

When exploring our data and experimenting with ML and RFs we should use samples to understead feature importances and dependances. This way we can interactively do our analysis.

DO NOT - use all of you data for all of your best parameters all the time - you are going to waste time.

Out of Bagging and OOB_Score

Concept: What if you don't have enough data to have a validation set? For the first tree, you can take the datapoints that were not used to pass it through the tree (it's only taking a sample of the data), as if it were a validation tree. Do this for all trees. We would have a diff valid set for each tree.

The data points are Out of Bag. If you have enough trees, every row in your sample will appear in at least 1 tree. So you can do a out of bag prediction and calculate our MSE, R^2 etc.

We can pass oob_score to skitlearn. The score is actually R^2 for the OOB sample.

Key insight: OOB_Score is great at telling you which model preformed best when you use grid search.

Exploring our data with Random Forests

In practice, for finding parameters, and best estimators, etc, it's easier to do this with a sample of the data and with less trees. Once you find everything you can do it with the whole dataset. You can run things in seconds instead of hours. Most people run all of their models and params on all of their data.

This allows you to interactively do your analysis.

Other Hyperparams

Min_samples_leaf: by default leafs will end up being n = 1. So if you set n = 5 for ex, it's going to average 5 data points. It can generalize better, be each tree might predict a little less good.

The best min_samples_leaf: 1, 3, 5, 10, 20, 25. (usually).

max_features : in addition to taking a subset or rows (bagging), we take a different subset of columns. It's like row sampling but here it's column sampling. Because maybe a certain column is way more predictive than all the others, always. Our trees will always end up having that feature as best predictor. But sometimes in the data there are interactions that are better predictors... so with max_features you can avoid this problem.

Default = 1 (use all of them). You can put in 0.5 to get half of them. In practice good values are: 1, 0.5, square root, and log2,

No need to scale values in Random Forests

Because we only care about the sort order, we don't care about the values at all. But for linear models, and things that are build out of linear models like NN, we need to scale about the values.

Encoding categorical variables with ordinal VS One hot encoding Random Forest

You don't need to one hot encode with random forest, you can simply use ordinal category. The reason is, if you have a variable, you can split it up any way you want like this:

The reason you would use one hot encoding is if you're interested in making a cardinality as a feature in order to measure feature importance. If you make too many one hot encodings, especially if you have feature with 50 cardinality then that actually makes the RF worse. The matrix will be too sparse. Too many one hot encoded features also won't be practically useful as we,ll have too many feature importances columns.

For NN, or linear models, logistic reg, you need to one hot encode, because with these, the best it could do is this:

Which is crap. So ordinal won't be useful for a linear model. Instead what we do is one hot encoding. With OHE it's effectively creating a histogram where we can have a different coefficient for each level:

But by creating an embedding, you can make it take less space and it's more optimal. It never gets too tedious to do this because mathematically, instead of getting the matrix with all OHE variables, something equivalent is to have the coefficients and do an index lookup.

Also if we try to solve something like this analytically, it all falls appart. This is because there is 100% collinearity between the 4th city and the rest of the cities. Sydney is either sydney when it equal 1, or it's 100% sidney when the others are 0.

With Stocastic gradient descent it's not a problem. We're just taking one step along the derrivative. It cares a little because in the end the main problem with collinearity, there's an infinite number of good solutions. We could decrease the coef's of the 1st 3 cats, and increase the fours, vice versa and everything in between, and it's gonna balance out. When there's an infinitely large number of good solutions it means there are a large number of flat spots in the loss function. So if we add a bit of regularization (weight decay), then we could get one solution which is the best. It's the one where the param are the smallest and the most similar to each other.

Embeddings

We have a matrix of words (that is basically onehotencoded). But that's not how we store it in memory, because in rl the matrix will have 200k columns (words) so we only store the words that are present, and not the whole matrix. We store it like the highlighted col above, ie, we only store the indexes of the words. It's called a sparse way of storing the matrix, just list out the indexes.

r represents the coeffecients of the words. We matrix multiply both.

In general, multiplying a matrix by a one hot encoded vector, is equivalent to looking up that matrix and find the nth row in it. So it's like saying, find the 0, 1st, 2nd, and 3rd coeff.

This computation trick is called an embedding.

Statistical significance for feature importance and all the rest

These days we have so much data that we don't think about stat significance that much. We would if we really have a super small data set. In that case we can bootstrap our data and do multiple runs to see the variation.

Last updated