Predicting how long an scikit-learn classification will take to run

scitime
randomforestclassifier
sklearn predict_proba
sklearn training time
sklearn decision tree
sklearn binary classification
sklearn logistic regression
svr taking too long

Is there a way to predict how long it will take to run a classifier from sci-kit learn based on the parameters and dataset? I know, pretty meta, right?

Some classifiers/parameter combinations are quite fast, and some take so long that I eventually just kill the process. I'd like a way to estimate in advance how long it will take.

Alternatively, I'd accept some pointers on how to set common parameters to reduce the run time.

There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this can be turned on by passing verbose=2 (any high number > 1) option to the constructor of individual models. Note: this behavior is according to sklearn-0.14. Earlier versions have a bit different verbose output (still useful though).

The best example of this is ensemble.RandomForestClassifier or ensemble.GradientBoostingClassifier` that print the number of trees built so far and remaining time.

clf = ensemble.GradientBoostingClassifier(verbose=3)
clf.fit(X, y)
Out:
   Iter       Train Loss   Remaining Time
     1           0.0769            0.10s
     ...

Or

clf = ensemble.RandomForestClassifier(verbose=3)
clf.fit(X, y)
Out:
  building tree 1 of 100
  ...

This progress information is fairly useful to estimate the total time.

Then there are other models like SVMs that print the number of optimization iterations completed, but do not directly report the remaining time.

clf = svm.SVC(verbose=2)
clf.fit(X, y)
Out:
   *
    optimization finished, #iter = 1
    obj = -1.802585, rho = 0.000000
    nSV = 2, nBSV = 2
    ...

Models like linear models don't provide such diagnostic information as far as I know.

Check this thread to know more about what the verbosity levels mean: scikit-learn fit remaining time

3.6. scikit-learn: machine learning in Python, (X_new) ), and returns the learned label for each object in the array. We can predict the class for new data instances using our finalized classification model in scikit-learn using the predict () function. For example, we have one or more data instances in an array called Xnew. This can be passed to the predict () function on our model in order to predict the class values for each instance in the array.

If you are using IPython, you can consider to use the built-in magic commands such as %time and %timeit

%time - Time execution of a Python statement or expression. The CPU and wall clock times are printed, and the value of the expression (if any) is returned. Note that under Win32, system time is always reported as 0, since it can not be measured.

%timeit - Time execution of a Python statement or expression using the timeit module.

Example:

In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop

References:

https://ipython.readthedocs.io/en/stable/interactive/magics.html

http://scikit-learn.org/stable/developers/performance.html

Getting started with Machine Learning using Sklearn-python, method in fifth line fits the training dataset as features (data) and labels (target) into the Naive Bayes' model. It is a list containing the predictions corresponding to each and every data point in the dataset. In a binary classification problem, is scikit's classifier.predict() using 0… python - Unbalanced classification using RandomForestClassifier in sklearn I have a dataset where the classes are unbalanced.

We're actually working on a package that gives runtime estimates of scikit-learn fits.

You would basically run it right before running the algo.fit(X, y) to get the runtime estimation.

Here's a simple use case:

from scitime import Estimator 
estimator = Estimator() 
rf = RandomForestRegressor()
X,y = np.random.rand(100000,10),np.random.rand(100000,1)
# Run the estimation
estimation, lower_bound, upper_bound = estimator.time(rf, X, y)

Feel free to take a look!

Predicting how long an scikit-learn classification will take to run, There are very specific classes of classifier or regressors that directly report remaining time or progress of your algorithm (number of iterations etc.). Most of this  Thus, Scikit Learn Cheat Sheet is one of the most important aspect as far as One potential methodology to "weigh" the various classifiers, might be to use their Jaccard score as a "weight". (But be warned, as I know it, the various scores don't seem to be "all created equal", i do know that a Gradient Boosting classifier I even have in my

Supervised learning: predicting an output variable from , We're really functioning on a package that provides runtime estimates of scikit-​learn fits. You would primarily run it right before running the  Keep TFIDF result for predicting new content using Scikit for Python. .predict(unseen_tfid) Predicting how long an scikit-learn classification will take to run.

An introduction to machine learning with scikit-learn, If the prediction task is to classify the observations in a set of finite labels, Note: See the Introduction to machine learning with scikit-learn Tutorial for a quick run-​through on the prediction of an estimator on the data used to fit the estimator as this would If you have several classes to predict, an option often used is to fit  For our work is to predict human wine taste preferences that are based on easily available analytical tests at the certification step. We expect to get an accuracy score of more than 90%.

How to Make Predictions with scikit-learn, In this section, we introduce the machine learning vocabulary that we use throughout entry (aka multivariate data), it is said to have several attributes or features. classification: samples belong to two or more classes and we want to learn An example of a regression problem would be the prediction of the length of a  $\begingroup$ sklearn's SVM implementation implies at least 3 steps: 1) creating SVR object, 2) fitting a model, 3) predicting value. First step describes kernel in use, which helps to understand inner processes much better. Second and third steps are pretty different, and we need to know at least which of them takes that long.

Comments
  • Look at time complexity of the algorithm and see for a smaller sample how much time it takes?
  • Thanks for the suggestion. I tried doing this, but it seems that some algorithms scale up somewhat linearly as the data grows, and some scale more exponentially. This is a good suggestion, and certainly better than nothing, but I'm wondering if there's an easier or more automated way than guess-and-check.
  • Thank you, this is very helpful! I saw verbosity, but didn't connect that it reported time remaining.