Converting statsmodels summary object to Pandas Dataframe

statsmodels ols
statsmodels summary explained
statsmodels summary to excel
statsmodels ols summary
pandas ols statsmodels
dmatrices pandas
statsmodels to latex
sm summary

I am doing multiple linear regression with statsmodels.formula.api (ver 0.9.0) on Windows 10. After fitting the model and getting the summary with following lines i get summary in summary object format.

X_opt  = X[:, [0,1,2,3]]
regressor_OLS = sm.OLS(endog= y, exog= X_opt).fit()
regressor_OLS.summary()


                          OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     296.0
Date:                Wed, 08 Aug 2018   Prob (F-statistic):           4.53e-30
Time:                        00:46:48   Log-Likelihood:                -525.39
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      46   BIC:                             1066.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.012e+04   6572.353      7.626      0.000    3.69e+04    6.34e+04
x1             0.8057      0.045     17.846      0.000       0.715       0.897
x2            -0.0268      0.051     -0.526      0.602      -0.130       0.076
x3             0.0272      0.016      1.655      0.105      -0.006       0.060
==============================================================================
Omnibus:                       14.838   Durbin-Watson:                   1.282
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               21.442
Skew:                          -0.949   Prob(JB):                     2.21e-05
Kurtosis:                       5.586   Cond. No.                     1.40e+06
==============================================================================

I want to do backward elimination for P values for significance level 0.05. For this i need to remove the predictor with highest P values and run the code again.

I wanted to know if there is a way to extract the P values from the summary object, so that i can run a loop with conditional statement and find the significant variables without repeating the steps manually.

Thank you.

The answer from @Michael B works well, but requires "recreating" the table. The table itself is actually directly available from the summary().tables attribute. Each table in this attribute (which is a list of tables) is a SimpleTable, which has methods for outputting different formats. We can then read any of those formats back as a pd.DataFrame:

import statsmodels.api as sm

model = sm.OLS(y,x)
results = model.fit()
results_summary = results.summary()

# Note that tables is a list. The table at index 1 is the "core" table. Additionally, read_html puts dfs in a list, so we want index 0
results_as_html = results_summary.tables[1].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0]

Python 2.7 - statsmodels, But it might be better to align with pandas, depending on what structure you have across models. edit: Here is an example storing the regression results in a dataframe summary() and similar code outside of statsmodels, for example BTW you can use dir(results) to find out all the attribute of an object. pandas builds on numpy arrays to provide rich data structures and data analysis tools. The pandas.DataFrame function provides labelled arrays of (potentially heterogenous) data, similar to the R “data.frame”. The pandas.read_csv function can be used to convert a comma-separated values file to a DataFrame object.

Store your model fit as a variable results, like so:

import statsmodels.api as sm
model = sm.OLS(y,x)
results = model.fit()

Then create a a function like below:

def results_summary_to_dataframe(results):
    '''take the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df

You can further explore all the attributes of the results object by using dir() to print, then add them to the function and df accordingly.

Getting started, The answer from @Michael B works well, but requires “recreating” the table. The table itself is actually directly available from the  Code Sample, a copy-pastable example if possible import pandas as pd import numpy as np uint64s = pd.Series([9710005220884355087, 9710005220138399309], dtype=np.dtype('uint64')) uint64s.dty

An easy solution is just one line of code:

LRresult = (result.summary2().tables[1])

This will give you a dataframe object:

type(LRresult)

pandas.core.frame.DataFrame

To get the significant variables and run the test again:

newlist = list(LRresult[LRresult['P>|z|']<=0.05].index)[1:]
myform1 = 'binary_Target' + ' ~ ' + ' + '.join(newlist)

M1_test2 = smf.logit(formula=myform1,data=myM1_1)

result2 = M1_test2.fit(maxiter=200)
LRresult2 = (result2.summary2().tables[1])
LRresult2

Outputting Regressions as Table in Python (similar to outreg in stata , import pandas as pd import statsmodels.formula.api as smf x = [1, 3, 5, 6, 8, 3, 4, 5, 1, 3, 5, 6, 8, 3, 4, 5, 0, 1, 0, 1, 1, 4, DataFrame(d) mod = smf.ols('y ~ x', data=​df) res = mod.fit() print(res.summary()) beginningtex = """\\documentclass{report}  DataFrame.memory_usage (self, index=True, deep=False) → pandas.core.series.Series [source] ¶ Return the memory usage of each column in bytes. The memory usage can optionally include the contribution of the index and elements of object dtype.

You may write as below.It will be a easy fix and work almost appropriate every time.

lr.summary2()

statsmodels.api.OLS Python Example, This page provides Python code examples for statsmodels.api. data : dataframe A Pandas dataframe with the data x : str A feature with categorical/​string def transform(self): start_time = time.time() d = self.dataset if self.target == d['target'].name: OLS(depvar,est_data) results = model.fit() print results.​summary() print  In this article we will discuss how to convert a single or multiple lists to a DataFrame. Python’s pandas library provide a constructor of DataFrame to create a Dataframe by passing objects i.e. Here data parameter can be a numpy ndarray , dict, or an other DataFrame. Also, columns and index are for column and index labels.

If you want the surrounding information, try the following:

import pandas as pd
dfs = {}
fs = fa_model.summary()
for item in fs.tables[0].data:
    dfs[item[0].strip()] = item[1].strip()
    dfs[item[2].strip()] = item[3].strip()
for item in fs.tables[2].data:
    dfs[item[0].strip()] = item[1].strip()
    dfs[item[2].strip()] = item[3].strip()
dfs = pd.Series(dfs)

3.1. Statistics in Python, Standard scientific Python environment (numpy, scipy, matplotlib); Pandas Data as a table; The pandas data-frame is a powerful object that exposes many operations on the resulting group of dataframes: >>> For a quick summary to the whole library, see the scipy chapter. We will use the statsmodels module to:​. Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail. percentiles : list-like of numbers, optional. The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which

Using Pandas, The pandas module provides objects similar to R's data frames, and these are more import pandas as pd import statsmodels.api as sm from pandas import Series, DataFrame, Panel from string Convert to numpy arrays with values print xs.values %%R -i df,status -o fit fit <- glm(status ~ ., data=df) print(summary(fit))​  pandas.DataFrame.convert_objects DataFrame.convert_objects (convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True) Deprecated. Attempt to infer better dtype for object columns

Essential Basic Functionality, To view a small sample of a Series or DataFrame object, use the head() and tail() value: 2.116284 Iterations 24 Out[121]: <class 'statsmodels.iolib.summary. which converts each row or column into a Series before applying the function. Loading data as pandas objects¶ For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

[PDF] Pandas DataFrame Notes, DataFrame object: The pandas DataFrame is a two- dimensional table of data dfs = df.describe() # summary stats cols Note: useful dtypes for Series conversion: int, float, str. Trap: index Regression import statsmodels.formula.​api as sm. When you load your data as Pandas dataframe, Pandas automatically assigns a datatype to the variables/columns in the data frame. For example, typically the datatypes would beint, float and object datatypes. With the recent Pandas 1.0.0, we can make Pandas infer the best datatypes for the variables in a dataframe.

Comments
  • The accepted answer shows how to convert the summary table to pandas DataFrame. However, for the use case of selection on p-values it is better to directly use the attribute results.pvalues, which is also used in the second answer.
  • This doesn't work for when using formula API. AttributeError: 'OLSResults' object has no attribute 'tables'
  • What version are you on? I'm on python 3.6.5 and using the latest version of statsmodels, but didn't test older versions.
  • Python 3.6.5, statsmodels 0.9.0
  • Woops - forgot the summary method! Thanks for pointing that out. Answer is updated.
  • Why didn't I think of that? Borderline hacky but very neat. Here's an alternative using the csv methods, in case it comes in handy: pd.read_csv(pd.compat.StringIO(table.as_csv()), index_col=0)
  • Thank you Michael B for the help.
  • No problem, if it worked please mark the answer as correct! Happy coding/data sci-ing!!
  • Summary2 is not yet considered stable, though looks close. See this discussion.