How to use warm_start

warm start vs partial fit
sgdregressor partial_fit example
randomforestregressor
randomforestclassifier
gradientboostingclassifier warm start
name 'randomforestregressor' is not defined
n_jobs sklearn
random forest regression loss function

I'd like to use the warm_start parameter to add training data to my random forest classifier. I expected it to be used like this:

clf = RandomForestClassifier(...)
clf.fit(get_data())
clf.fit(get_more_data(), warm_start=True)

But the warm_start parameter is a constructor parameter. So do I do something like this?

clf = RandomForestClassifier()
clf.fit(get_data())
clf = RandomForestClassifier (warm_start=True)
clf.fit(get_more_data)

That makes no sense to me. Won't the new call to the constructor discard previous training data? I think I'm missing something.

The basic pattern of (taken from Miriam's answer):

clf = RandomForestClassifier(warm_start=True)
clf.fit(get_data())
clf.fit(get_more_data())

would be the correct usage API-wise.

But there is an issue here.

As the docs say the following:

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

it means, that the only thing warm_start can do for you, is adding new DecisionTree's. All the previous trees seem to be untouched!

Let's check this with some sources:

  n_more_estimators = self.n_estimators - len(self.estimators_)

    if n_more_estimators < 0:
        raise ValueError('n_estimators=%d must be larger or equal to '
                         'len(estimators_)=%d when warm_start==True'
                         % (self.n_estimators, len(self.estimators_)))

    elif n_more_estimators == 0:
        warn("Warm-start fitting without increasing n_estimators does not "
             "fit new trees.")

This basically tells us, that you would need to increase the number of estimators before approaching a new fit!

I have no idea what kind of usage sklearn expects here. I'm not sure, if fitting, increasing internal variables and fitting again is correct usage, but i somehow doubt it (especially as n_estimators is not a public class-variable).

Your basic approach (in regards to this library and this classifier) is probably not a good idea for your out-of-core learning here! I would not pursue this further.

3.2.4.3.1. sklearn.ensemble.RandomForestClassifier, Use min_impurity_decrease instead. Whether to use out-of-bag samples to estimate the generalization accuracy. n_jobsint warm_startbool, default=False. I'd like to use the warm_start parameter to add training data to my random forest classifier. I expected it to be used like this: clf = RandomForestClassifier() clf.fit(get_data()) clf.fit(get_more_data(), warm_start=True) But the warm_start parameter is a constructor parameter.

Just to add to excellent @sascha`s answer, this hackie method works:

rf = RandomForestClassifier(n_estimators=1, warm_start=True)                     
rf.fit(X_train, y_train)
rf.n_estimators += 1
rf.fit(X_train, y_train) 

Has anyone used the warm_start in RandomForests? : MLQuestions, I read the glossary however I didn't quite grasp the usage of it. I haven't used it, but from the docs it looks like you can use the warm_start parameter to retain  I'd like to use the warm_start parameter to add training data to my random forest classifier. I expected it to be used like this: clf = RandomForestClassifier() clf.fit(get_data()) clf.fit(get_more_data(), warm_start=True) But the warm_start parameter is a constructor parameter.

from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target

### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))

Below is differentiation between warm_start and partial_fit.

When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit. Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.

partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.

There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

warm_start in random forests · Issue #3364 · scikit-learn/scikit-learn , Another use-case side-note: I can see using this with sample weights as a version of online learning with RFs. To use the Amazon SageMaker Python SDK to run a warm start tuning job, you: Specify the parent jobs and the warm start type by using a WarmStartConfig object. Pass the WarmStartConfig object as the value of the warm_start_config argument of a HyperparameterTuner object. Call the fit method of the HyperparameterTuner object.

All warm_start does boils down to preserving the state of the previous train.


It differs from a partial_fit in that the idea is not to incrementally learn on small batches of data, but rather to re-use a trained model in its previous state. Namely the difference between a regular call to fit and a fit having set warm_start=True is that the estimator state is not cleared, see _clear_state

if not self.warm_start:
    self._clear_state()

Which, among other parameters, would initialize all estimators:

if hasattr(self, 'estimators_'):
    self.estimators_ = np.empty((0, 0), dtype=np.object)

So having set warm_start=True in each subsequent call to fit will not initialize the trainable parameters, instead it will start from their previous state and add new estimators to the model.


Which means that one could do:

grid1={'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10]}

rf_grid_search1 = GridSearchCV(estimator = RandomForestClassifier(), 
                               param_distributions = grid1,
                               cv = 3,
                               random_state=12)
rf_grid_search1.fit(X_train, y_train)

Then fit a model on the best parameters and set warm_start=True:

rf = RandomForestClassifier(**rf_grid_search1.best_params_, warm_start=True)
rf.fit(X_train, y_train)

Then we could perform GridSearch only on say n_estimators:

grid2 = {'n_estimators': [200, 400, 600, 800, 1000]}
rf_grid_search2 = GridSearchCV(estimator = rf,
                               param_distributions = grid2,
                               cv = 3, 
                               random_state=12,
                               n_iter=4)
rf_grid_search2.fit(X_train, y_train)

The advantage here is that the estimators would already be fit with the previous parameter setting, and with each subsequent call to fit, the model will be starting from the previous parameters, and we're just analyzing if adding new estimators would benefit the model.

Incremental Learning with sklearn: warm_start, partial_fit(), fit , SGDClassifier , only one epoch is run when using .partial_fit() . If I want k epochs, would calling it on the same dataset repeatedly be the  Using the warmstart flag for the GradientBoostingClassifier results in an inferior solution compared to fitting all base models at once. From how the docs regarding the warmstarting are written, I expect the results to be equal.

Warm_start on Random Forest Classifiers, Random Forest", which might be useful for some of my use cases. Examining the code in forest.py, using warm_start=Trure means the  Using remote start helps you warm up your vehicle’s engine and bring the cabin to a comfortable temperature. Get in and go without frigid fingers or cold air blowing from your vents as you wait for your Nissan to heat up.

How to combine models in Python?, using the 'warm_start=True' parameter in a RandomForestClassifier, but I get this error: "__init__() got an unexpected keyword argument 'warm_start' ". Learn how to use the Ford remote start system in your vehicle up to 300 ft away! Adjust your car's interior temperature without getting in. See how to set a longer run time and use your intelligent access key fob.

tf.compat.v1.train.warm_start, tf.compat.v1.train.warm_start Warm-starts a model using the given settings. If you do not have access to the Variable objects at the call site, please use the  To use a bread machine, start by adding all of your ingredients to the bread pan, including yeast, flour, salt, sugar, and whatever liquids and fats you'll be using. Then, snap the bread pan into place and choose a cycle, like white or whole wheat, by pressing the select button until you get to the cycle you want.

Comments
  • warm_start is intended to be used on same data. What is your use case? You want to train the classifier in batches, small data at one time?
  • @VivekKumar I'd like to incrementally train the classifier. I have a base dataset, and incoming batches of newly created training data (the base set and the new batches have the same shape, I'm not adding extra features or anything like that, just more training data). Now I could re-initialise the model with the base dataset merged with the new batch of data and train on that, but that is too slow. I'd like to 'resume' the training process with the new batch of training data. I hope that makes sense.
  • Only estimators in scikit-learn which support incremental learning are given on scikit-learn.org/stable/modules/…. RandomForestClassifier is not one of them.
  • I know this is old, but - how do I approach this with a saved model? Is it even possible?
  • does this work if my X_train, y_train is different from the first fit () ? With your current code, if data in both the fits is different, classifier doesnot remember the previous fit(). It just remembers the last fit() data we trained on.
  • Would that automatically make the bootstrap too?
  • interesting answer. Now, let's say you want to use a sklearn model that provides partial_fit, in order to do incremental learning. One issue can be: The number of features for the model can increase over time. And with new features, you might need to re train the model from scratch, while you were making partial_fit for weeks... However, if you set warm_start=True, and combine it with partial fit, you can change the number of features over time, and never need to re-train from zero. Don't you agree?
  • I'm not so sure about that @nolw38 you could give it a go to check. My guess is that is should be possible for certain models? For instance a random forest could be ok with this since it is only training on a fixed size subset of features, so the new estimators will be trained on equally sized arrays. Though there might be some check or something that I'm not thinking of. Let me know if you try it out :)