Lessons 3 & 4
Using RF to better understand our data.
Random stats info
Correlation is between 2 variables, R^2 is between the dependant and a independant varaible.
In the hierarchical clustering, because we want to find a way to find similar columns in a way that RFs would find them similar, RF don't care about linearity they care about ordering. So ranked correlation is used. He uses Spearman's R = the most common ranked correlation.
Check graphs to identify if there is interaction between 2 independant vars - plot 2 indep with the dep variable. Then create these interactions as new cols.
Tips for not overfitting
We want each tree to be as predictive as possible
We want them to be as independent as possible (recent research shows that this is more important)
When we use set_rf_samples, the smaller the sample, the more we are decreasing each tree's predicting power, but the more we are making them independent.
So if we're using set_rf_samples, we should increase the amount of trees we're training the data on. We need more trees to compensate for each tree being slightly less predictive.
Testing your test set
Once you've done training and you figured out your hyperparam with the validation set, you can use both valid + training to test the testing set. Make sure you use the best param that worked for the validation set.
When to use Random Forests
It's almost always beneficial to do random forest for a dataset. The question is, when should I also use something else?
For unstructured data, like wave form in a sound, pixels in an image, text in a speech, use deep learning.
RF Weaknesses
RF is not able to extrapolate to data it hasn't seem before, such as future time periods so time series data. The only thing RF does is to average stuff it has already seen. When there is too much complexity and variables with super high cardinality, does fails to really find all the complexity.
One way to deal with this - use a neural net. Use all the time series techniques to fit some kind of time series and de-trend it. And use the RF to predict those. Or use a Gradient boosting machine which handles this nicely. They can't extrapolate to the future but they can deal with time dependant data more conveniently.
Conclusion of removing the least important cols
If removing these cols makes it worse, they were important after all. So we're hoping that we're simplifiying the model and maybe making it a little bit better.
Reason: If we thinki about what the trees are doing, with these cols removed it's going to worry less about what to split on, less often going to accidentally find a bad col, etc.
We're also removing colinearity (possibly). By removing some of the cols with little impact it makes our feature importance clearer because some of the importance might have been slit between 2 cols. Now that they are removed, it doesn't.
How is feature importance calculated?
We take our tree and our dataset. A col by col, we take the values and shuffle them. This way the col is 'broken' and doesn't have any correlation with our prediction. We the run a prediction, find the R^2 and see what the score is compared to when we had the whole dataset. Maybe we go from R^2 = 90 to R^2 = 80. This would mean that col is super improtant.
Why not just remove the col and do another RF? We we'd have to train and build a tottally new model, where as with this technique we're just running predictions, the model is already trained.
Why is OOB score slightly lower than a validation set score?
Because with OOB, not every tree contains that row since every tree is using a sample of the training set. Each row will be represented (with enough trees), but only some trees are going to make the prediction on that rows, ie, the trees that had that row in their sample. Therefore, when predicting on the OOB, only part of the trees are going to make a pred for that row. With a validation set, every tree will make a prediction for that row. More trees = better prediction.
Train_split_test() in sklearn
The fact that it gives you a random sample means that a lot of the times you shouldn't be using it.
The fact that there is an OOB with random forest, means that it is useful, but if we're dealing with time series data we know that it generalizes in a statistic sens, not in a practical sens because the oob won't give you future predictions, only random samples of the training set.
Crossvalidation
It's a lot like RF where we're training on k/n of the data, and we end up training on all of the data.
Downsides:
Time - you need to train k models instead of 1.
we had concerns about the fact that random validations sets are a pb, and it's also valid here. These k validation sets are random. CV doesn't work for temporal data. There's not good way to do CV.
Contributions section
For each row you can see the impact each feature has on the target variable.
Do we need to normalize when training a random forest?
When we're deciding where to split, all that matters is how they are sorted. So if we normalize, they are still sorted in the same order. When we implemented the RF, we sorted them then we completely ignored the values. RF only care about the osrt order of the indep variables.
This is why they are immune to outliers, because they totally ignore outliers, they only care about what is higher. This occuers in some metrics too:
Ex; AUC - this completely ignores scale and only cares about sort.
Same with the dendogram: spearman's correlation only cares about order not scale.
Last updated