Training and Tuning
Last updated
Last updated
Underfitting: Does not do well in the training set nor the testing set. We call this type of error an error due to bias. We're simplifying the model too much.
Overfitting: Does well in the training but not in the testing set. Wel call this type of error an error due to variance. This is true for regression and classification. We overcomplicate the problem.
Tradeoff to find between the two.
Full points - training set, non full points - testing set. The graph below: number of errors per graph and per training/testing data.
However, we used our testing data to train our model. We shouldn't be looking at our testing data to make the decision. How do we fix this? How do we make a good decision without touching the testing data?
Solution: Cross validation. We use this to pick the best model (which model and which degree polynomial for ex). We also calculate for ex the F1 score.
Now our graph looks like this :
Usually a model complexity graph will look like this one:
Always recommended to randomize our data before doing cross validation to remove bias.
It would look like this: (shuffle = True)
More folds = more computationally expensive.
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score()
function uses R^2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.
When applied to other polynomial degree models:
Grid Search
Make a table of all the different possibilities of the hyperparameters and pick the best one.
Grid Search in sklearn is very simple. We'll illustrate it with an example. Let's say we'd like to train a support vector machine, and we'd like to decide between the following parameters:
kernel: poly
or rbf
.
C: 0.1, 1, or 10.
(Note: These parameters can be used as a black box now, but we'll see them in detail in the Supervised Learning Section of the nanodegree.)
The steps are the following:
Here we pick what are the parameters we want to choose from, and form a dictionary. In this dictionary, the keys will be the names of the parameters, and the values will be the lists of possible values for each parameter.
We need to decide what metric we'll use to score each of the candidate models. In here, we'll use F1 Score.
Now you can use this estimator best_clf
to make the predictions.
In the next page, you'll find a lab where you can use GridSearchCV to optimize a decision tree model.
Also:
Important thing in Grid Search - we include cross validation to avoid overfitting on the hyperparameters.
GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV
, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.
Using all data for cross-validation is not ideal.
Better to split data into training and hold out set at the beginning. Then preform grid search cross val on training set. Then choose best hyperparameters and evaluate on hold out set.
Example 1
Example 2