Training and Tuning

Types of errors you can make when training a graph:

Underfitting: Does not do well in the training set nor the testing set. We call this type of error an error due to bias. We're simplifying the model too much.

Overfitting: Does well in the training but not in the testing set. Wel call this type of error an error due to variance. This is true for regression and classification. We overcomplicate the problem.

Tradeoff to find between the two.

Model Complexity Graph

Full points - training set, non full points - testing set. The graph below: number of errors per graph and per training/testing data.

However, we used our testing data to train our model. We shouldn't be looking at our testing data to make the decision. How do we fix this? How do we make a good decision without touching the testing data?

Solution: Cross validation. We use this to pick the best model (which model and which degree polynomial for ex). We also calculate for ex the F1 score.

Now our graph looks like this :

Usually a model complexity graph will look like this one:

k-fold cross validation

Always recommended to randomize our data before doing cross validation to remove bias.

It would look like this: (shuffle = True)

More folds = more computationally expensive.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score()function uses R^2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

Learning Curves

When applied to other polynomial degree models:

Grid Search

Make a table of all the different possibilities of the hyperparameters and pick the best one.

Grid Search in sklearn

Grid Search in sklearn is very simple. We'll illustrate it with an example. Let's say we'd like to train a support vector machine, and we'd like to decide between the following parameters:

  • kernel: poly or rbf.

  • C: 0.1, 1, or 10.

(Note: These parameters can be used as a black box now, but we'll see them in detail in the Supervised Learning Section of the nanodegree.)

The steps are the following:

1. Import GridSearchCV

from sklearn.model_selection import GridSearchCV

2. Select the parameters:

Here we pick what are the parameters we want to choose from, and form a dictionary. In this dictionary, the keys will be the names of the parameters, and the values will be the lists of possible values for each parameter.

parameters = {'kernel':['poly', 'rbf'],'C':[0.1, 1, 10]}

3. Create a scorer.

We need to decide what metric we'll use to score each of the candidate models. In here, we'll use F1 Score.

from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
scorer = make_scorer(f1_score)

4. Create a GridSearch Object with the parameters, and the scorer. Use this object to fit the data.

# Create the object.
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
# Fit the data
grid_fit = grid_obj.fit(X, y)

5. Get the best estimator.

best_clf = grid_fit.best_estimator_

Now you can use this estimator best_clf to make the predictions.

In the next page, you'll find a lab where you can use GridSearchCV to optimize a decision tree model.

Also:

Important thing in Grid Search - we include cross validation to avoid overfitting on the hyperparameters.

# Import necessary modules
from sklearn.linear_model  import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(best_params_)) 
print("Best score is {}".format(best_score_))

Randomized Search CV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.

Hold-out set reasoning

Using all data for cross-validation is not ideal.

Better to split data into training and hold out set at the beginning. Then preform grid search cross val on training set. Then choose best hyperparameters and evaluate on hold out set.

Example 1

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.4,
                                                    random_state = 42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

Example 2

# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}

# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)

# Fit it to the training data
gm_cv.fit(X_train, y_train)

# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = gm_cv.score(X_test, y_test)

print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

Splitting a multiclass dataset (mutiple labels we want to predict, not just 0 and 1).

Last updated

Was this helpful?