## Using statsmodel estimations with scikit-learn cross validation, is it possible?

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use the `fit`

object (result) ontained from python statsmodel to feed into `cross_val_score`

of scikit-learn cross_validation method?
The attached link suggests that it may be possible but I have not succeeded.

I am getting the following error

estimator should a be an estimator implementing 'fit' method statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x7fa6e801c590 was passed

Refer this link

**Difference between statsmodel OLS and scikit ,** Is there a way that work with test data set with OLS ? Scikit-learn follows the machine learning tradition where the main supported task is But, when we want to do cross-validation for prediction in statsmodels it is currently still setup of scikit-learn together with the estimation models of statsmodels. Using statsmodel estimations with scikit-learn cross validation, is it possible? Ask Question Asked 3 years, 7 months ago. Active 5 months ago.

Following the suggestion of David (which gave me an error, complaining about missing function `get_parameters`

) and the scikit learn documentation, I created the following wrapper for a linear regression.
It has the same interface of `sklearn.linear_model.LinearRegression`

but in addition has also the function `summary()`

, which gives the info about p-values, R2 and other statistics, as in `statsmodels.OLS`

.

import statsmodels.api as sm from sklearn.base import BaseEstimator, RegressorMixin import pandas as pd import numpy as np from sklearn.utils.multiclass import check_classification_targets from sklearn.utils.validation import check_X_y, check_is_fitted, check_array from sklearn.utils.multiclass import unique_labels from sklearn.utils.estimator_checks import check_estimator class MyLinearRegression(BaseEstimator, RegressorMixin): def __init__(self, fit_intercept=True): self.fit_intercept = fit_intercept """ Parameters ------------ column_names: list It is an optional value, such that this class knows what is the name of the feature to associate to each column of X. This is useful if you use the method summary(), so that it can show the feature name for each coefficient """ def fit(self, X, y, column_names=() ): if self.fit_intercept: X = sm.add_constant(X) # Check that X and y have correct shape X, y = check_X_y(X, y) self.X_ = X self.y_ = y if len(column_names) != 0: cols = column_names.copy() cols = list(cols) X = pd.DataFrame(X) cols = column_names.copy() cols.insert(0,'intercept') print('X ', X) X.columns = cols self.model_ = sm.OLS(y, X) self.results_ = self.model_.fit() return self def predict(self, X): # Check is fit had been called check_is_fitted(self, 'model_') # Input validation X = check_array(X) if self.fit_intercept: X = sm.add_constant(X) return self.results_.predict(X) def get_params(self, deep = False): return {'fit_intercept':self.fit_intercept} def summary(self): print(self.results_.summary() )

Example of use:

cols = ['feature1','feature2'] X_train = df_train[cols].values X_test = df_test[cols].values y_train = df_train['label'] y_test = df_test['label'] model = MyLinearRegression() model.fit(X_train, y_train) model.summary() model.predict(X_test)

If you want to show the names of the columns, you can call

model.fit(X_train, y_train, column_names=cols)

To use it in cross_validation:

from sklearn.model_selection import cross_val_score scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error') scores

**3.1. Cross-validation: evaluating estimator performance,** The mean score and the 95% confidence interval of the score estimate are hence It is also possible to use other cross validation strategies by passing a cross� 26 Using statsmodel estimations with scikit-learn cross validation, is it possible? Feb 23 '18. 23 How to compare predictive power of PCA and NMF Mar 11 '18.

For reference purpose, if you use the `statsmodels`

formula API and/or use the `fit_regularized`

method, you can modify @David Dale's wrapper class in this way.

import pandas as pd from sklearn.base import BaseEstimator, RegressorMixin from statsmodels.formula.api import glm as glm_sm # This is an example wrapper for statsmodels GLM class SMWrapper(BaseEstimator, RegressorMixin): def __init__(self, family, formula, alpha, L1_wt): self.family = family self.formula = formula self.alpha = alpha self.L1_wt = L1_wt self.model = None self.result = None def fit(self, X, y): data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) data.columns = X.columns.tolist() + ['y'] self.model = glm_sm(self.formula, data, family=self.family) self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True) return self.result def predict(self, X): return self.result.predict(X)

**1.1. Linear Models — scikit-learn 0.23.1 documentation,** RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The Lasso is a linear model that estimates sparse coefficients. tasks, as it computes the coefficients along the full path of possible values. The statsmodels package <https://pypi.org/project/statsmodels/> natively supports this . 20 Using statsmodel estimations with scikit-learn cross validation, is it possible? Dec 8 '16 5 Can I extract significane values for Logistic Regression coefficients in pyspark Dec 5 '16

Though I think this is not technically scikit-learn, there is the package pmdarima that wraps statsmodel and provides a scikit-learn like interface.

**An Introduction to Regression in Python with statsmodels and scikit ,** � As is evident from these descriptions, statsmodels and scikit-learn are designed to We will generate this line using a linear regression model. predicted values that are as close as possible to the actual outcome values. We can plug the intercept and regression weight estimates from the statsmodels� Using statsmodel estimations with scikit-learn cross validation, is it possible? Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression.

**An Introduction to Logistic Regression in Python with statsmodels ,** While it is technically possible to calculate a probability of belonging to one in Python with statsmodels and scikit-learn” walks through examples of simple A linear regression model will fit a straight prediction line to the data points, but when This curved line in the image above shows each data point's� In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors). Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model.

**Kernel Density Estimation in Python,** Statsmodels KDEMultivariate, normal reference cross-validation, Seven, Yes, Yes, No, No. Scikit-Learn, None built-in; Cross val. available, 6� We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation. We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper.

**How To Compare Machine Learning Algorithms in Python with scikit ,** Using resampling methods like cross validation, you can get an estimate for how You need to be able to use these estimates to choose one or two best models from the through the scipy and statsmodels libs rather than directly available in sklearn. Since it is the training accuracy this value is normal? To start off, watch this presentation that goes over what Cross Validation is. Note: There are 3 videos + transcript in this series. The videos are mixed with the transcripts, so scroll down if you are only interested in the videos. Make sure you turn on HD.