Using explicit (predefined) validation set for grid search with sklearn

Related searches

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.

I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?

from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV

# (some code left out to simplify things)

skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
                             class_weight=penalty_weights),
                     param_grid=tuned_parameters,
                     n_jobs=2,
                     pre_dispatch="n_jobs",
                     cv=skf,
                     scoring=scorer)
clf.fit(X_train, y_train)

Use PredefinedSplit

ps = PredefinedSplit(test_fold=your_test_fold)

then set cv=ps in GridSearchCV

test_fold : "array-like, shape (n_samples,)

test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

Also see here

when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.

Using explicit (predefined) validation set for grid search � Issue #211 , Using explicit (predefined) validation set for grid search #211 already relevant answered questions on StackOverflow regarding Scikit-Learn :. Use PredefinedSplit. ps = PredefinedSplit(test_fold=your_test_fold) then set cv=ps in GridSearchCV. test_fold : “array-like, shape (n_samples,) test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold. Also see here

Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.

In scikit-learn, in GridSearchCV, can I manually set validation set , # Grid-search all parameter combinations using a validation set. opt = GridSearch(model = SVR()). opt.fit(X_train, y_train, param_gri. Use PredefinedSplit. ps = PredefinedSplit(test_fold=your_test_fold) then set cv=ps in GridSearchCV. test_fold : “array-like, shape (n_samples,) test_fold [i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit

# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)

# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)

Specific the Validation set in GridSearchCV, Then, generate your fold indices accordingly, using PredefinedSplit or way is just simple manual looping throughout your parameter grid. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below: Following this answer, I did the following:

Parameter estimation using grid search with cross-validation , This examples shows how a classifier is optimized by cross-validation, which is done using the sklearn.model_selection.GridSearchCV object on a development set that comprises only half of the available labeled data. The performance of the � generates all the combinations of a an hyperparameter grid. sklearn.cross_validation.train_test_split utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation. sklearn.metrics.make_scorer Make a scorer from a performance metric or loss function.

get the (validation) accuracy using the validation set (cross-validation test) change parameters and continue with 2 until found parameters leading to best validation accuracy get the (test) accuracy using the test set which represents the actual expected accuracy of your trained algorithm on new unseen data.

Examples: See Parameter estimation using grid search with cross-validation for an example of Grid Search computation on the digits dataset.. See Sample pipeline for text feature extraction and evaluation for an example of Grid Search coupling parameters from a text documents feature extractor (n-gram count vectorizer and TF-IDF transformer) with a classifier (here a linear SVM trained with SGD

Comments
  • If we are doing this, should be replace clf.fit(X_train, y_train) by clf.fit(X, y)
  • @edesz: if refit=True in GridSearchCV then the OP should know that he cannot use the GridSearchCV instance later to predict, because the last thing the instance will do when finished searching for optimal params is to refit the best option to (X, y), but the intention is actually refitting on (X_train, y_train).
  • hypopt is a great module for hyperparameter search. A question: how can I specify the metric in hyperparameter search? where do I put like 'auc', 'f1' etc? I posted this question here stackoverflow.com/questions/52912331/… @cgnorthcutt
  • Answered on the post, but in short, upgrade hypopt package to latest version 1.0.7 and just use the scoring parameter like this `optimizer.fit(X_train, y_train, params, X_val, y_val, scoring='f1'). @zesla
  • @cgnorthcutt The scoring parameter for fit function does not work. I am unable to specify scoring = 'f1'.
  • That's unusual. Submit a pull request if so please.
  • See my comment in the accepted answer. You need to be careful not to use clf later to predict.