Using statsmodel estimations with scikit-learn cross validation, is it possible?

Using statsmodel estimations with scikit-learn cross validation, is it possible?

sklearn cross validation
sklearn linear regression
statsmodels
statsmodel logistic regression
multiple linear regression python sklearn example
cross_val_score
scikit-learn rolling regression
k fold cross validation python code without sklearn

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use the fit object (result) ontained from python statsmodel to feed into cross_val_score of scikit-learn cross_validation method? The attached link suggests that it may be possible but I have not succeeded.

I am getting the following error

estimator should a be an estimator implementing 'fit' method statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x7fa6e801c590 was passed

Refer this link


Difference between statsmodel OLS and scikit , Is there a way that work with test data set with OLS ? Scikit-learn follows the machine learning tradition where the main supported task is But, when we want to do cross-validation for prediction in statsmodels it is currently still setup of scikit-learn together with the estimation models of statsmodels. Using statsmodel estimations with scikit-learn cross validation, is it possible? Ask Question Asked 3 years, 7 months ago. Active 5 months ago.


Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin
import pandas as pd
import numpy as np

from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import unique_labels
from sklearn.utils.estimator_checks import check_estimator



class MyLinearRegression(BaseEstimator, RegressorMixin):
    def __init__(self, fit_intercept=True):

        self.fit_intercept = fit_intercept


    """
    Parameters
    ------------
    column_names: list
            It is an optional value, such that this class knows 
            what is the name of the feature to associate to 
            each column of X. This is useful if you use the method
            summary(), so that it can show the feature name for each
            coefficient
    """ 
    def fit(self, X, y, column_names=() ):

        if self.fit_intercept:
            X = sm.add_constant(X)

        # Check that X and y have correct shape
        X, y = check_X_y(X, y)


        self.X_ = X
        self.y_ = y

        if len(column_names) != 0:
            cols = column_names.copy()
            cols = list(cols)
            X = pd.DataFrame(X)
            cols = column_names.copy()
            cols.insert(0,'intercept')
            print('X ', X)
            X.columns = cols

        self.model_ = sm.OLS(y, X)
        self.results_ = self.model_.fit()
        return self



    def predict(self, X):
        # Check is fit had been called
        check_is_fitted(self, 'model_')

        # Input validation
        X = check_array(X)

        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)


    def get_params(self, deep = False):
        return {'fit_intercept':self.fit_intercept}


    def summary(self):
        print(self.results_.summary() )

Example of use:

cols = ['feature1','feature2']
X_train = df_train[cols].values
X_test = df_test[cols].values
y_train = df_train['label']
y_test = df_test['label']
model = MyLinearRegression()
model.fit(X_train, y_train)
model.summary()
model.predict(X_test)

If you want to show the names of the columns, you can call

model.fit(X_train, y_train, column_names=cols)

To use it in cross_validation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')
scores

3.1. Cross-validation: evaluating estimator performance, The mean score and the 95% confidence interval of the score estimate are hence It is also possible to use other cross validation strategies by passing a cross� 26 Using statsmodel estimations with scikit-learn cross validation, is it possible? Feb 23 '18. 23 How to compare predictive power of PCA and NMF Mar 11 '18.


For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

import pandas as pd
from sklearn.base import BaseEstimator, RegressorMixin
from statsmodels.formula.api import glm as glm_sm

# This is an example wrapper for statsmodels GLM
class SMWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, family, formula, alpha, L1_wt):
        self.family = family
        self.formula = formula
        self.alpha = alpha
        self.L1_wt = L1_wt
        self.model = None
        self.result = None
    def fit(self, X, y):
        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
        data.columns = X.columns.tolist() + ['y']
        self.model = glm_sm(self.formula, data, family=self.family)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)
        return self.result
    def predict(self, X):
        return self.result.predict(X)

1.1. Linear Models — scikit-learn 0.23.1 documentation, RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The Lasso is a linear model that estimates sparse coefficients. tasks, as it computes the coefficients along the full path of possible values. The statsmodels package <https://pypi.org/project/statsmodels/> natively supports this . 20 Using statsmodel estimations with scikit-learn cross validation, is it possible? Dec 8 '16 5 Can I extract significane values for Logistic Regression coefficients in pyspark Dec 5 '16


Though I think this is not technically scikit-learn, there is the package pmdarima that wraps statsmodel and provides a scikit-learn like interface.

An Introduction to Regression in Python with statsmodels and scikit , � As is evident from these descriptions, statsmodels and scikit-learn are designed to We will generate this line using a linear regression model. predicted values that are as close as possible to the actual outcome values. We can plug the intercept and regression weight estimates from the statsmodels� Using statsmodel estimations with scikit-learn cross validation, is it possible? Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression.


An Introduction to Logistic Regression in Python with statsmodels , While it is technically possible to calculate a probability of belonging to one in Python with statsmodels and scikit-learn” walks through examples of simple A linear regression model will fit a straight prediction line to the data points, but when This curved line in the image above shows each data point's� In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors). Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model.


Kernel Density Estimation in Python, Statsmodels KDEMultivariate, normal reference cross-validation, Seven, Yes, Yes, No, No. Scikit-Learn, None built-in; Cross val. available, 6� We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for cross-validation. We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper.


How To Compare Machine Learning Algorithms in Python with scikit , Using resampling methods like cross validation, you can get an estimate for how You need to be able to use these estimates to choose one or two best models from the through the scipy and statsmodels libs rather than directly available in sklearn. Since it is the training accuracy this value is normal? To start off, watch this presentation that goes over what Cross Validation is. Note: There are 3 videos + transcript in this series. The videos are mixed with the transcripts, so scroll down if you are only interested in the videos. Make sure you turn on HD.