In a pipeline, each step but the last one must be a transformer. The last must be an estimater like a regressor, a classifier.
# Import the Imputer modulefrom sklearn.preprocessing import Imputerfrom sklearn.svm import SVC# Setup the Imputation transformer: impimp =Imputer(missing_values='NaN', strategy='most_frequent', axis=0)# Instantiate the SVC classifier: clfclf =SVC()# Setup the pipeline with the required steps: stepssteps = [('imputation', imp), ('SVM', clf)]
Example 2:
# Import necessary modulesfrom sklearn.preprocessing import Imputerfrom sklearn.pipeline import Pipelinefrom sklearn.svm import SVC# Setup the pipeline steps: stepssteps = [('imputation',Imputer(missing_values='NaN', strategy='most_frequent', axis=0)), ('SVM',SVC())]# Create the pipeline: pipelinepipeline =Pipeline(steps)# Create training and test setsX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=42)# Fit the pipeline to the train setpipeline.fit(X_train, y_train)# Predict the labels of the test sety_pred = pipeline.predict(X_test)# Compute metricsprint(classification_report(y_test, y_pred))
Normalizing and standardizing
Standardizing: Subtract the mean and divide by the variance.
Also you could subtract the minimum and divide by the range : Min 0 and Max 1.
Pipeline for classification
# Setup the pipelinesteps = [('scaler',StandardScaler()), ('SVM',SVC())]pipeline =Pipeline(steps)# Specify the hyperparameter spaceparameters ={'SVM__C':[1,10,100],'SVM__gamma':[0.1,0.01]}# Create train and test setsX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=21)# Instantiate the GridSearchCV object: cvcv =GridSearchCV(pipeline, parameters, cv =3)# Fit to the training setcv.fit(X_train, y_train)# Predict the labels of the test set: y_predy_pred = cv.predict(X_test)# Compute and print metricsprint("Accuracy: {}".format(cv.score(X_test, y_test)))print(classification_report(y_test, y_pred))print("Tuned Model Parameters: {}".format(cv.best_params_))
Pipeline for regression
# Setup the pipeline steps: stepssteps = [('imputation',Imputer(missing_values='NaN', strategy='mean', axis=0)), ('scaler',StandardScaler()), ('elasticnet',ElasticNet())]# Create the pipeline: pipeline pipeline =Pipeline(steps)# Specify the hyperparameter spaceparameters ={'elasticnet__l1_ratio':np.linspace(0,1,30)}# Create train and test setsX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.4, random_state=42)# Create the GridSearchCV object: gm_cvgm_cv =GridSearchCV(pipeline, parameters, cv =3)# Fit to the training setgm_cv.fit(X_train, y_train)# Compute and print the metricsr2 = gm_cv.score(X_test, y_test)print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))print("Tuned ElasticNet R squared: {}".format(r2))