Specifying multiple columns names with the same prefix efficiently

pandas column names
pandas add prefix to some column names
pandas select columns containing string
pandas column names starts with
select all columns with prefix pandas
pandas rename multiple columns
pandas select rows containing string
how to get column names

I am running a regression with my observation being at the company level. I want to control for the type of company [what does it produce]. I have this information in an object variable which I turn into categorical and then get the dummies out of it.

df['Product Type'] = df['Product Type'].astype('category')
df =  pd.get_dummies(df, columns=['Product Type']).head()  

My sample is quite large and I end up getting a lot of dummy variables. It is quite a lot of work to introduce them into my model one by one (there might be 10-15 of them).

reg = sm.OLS(endog=df['Y'], exog= df[['X1', 'Number of workers', 'X2', "Product Type_Jewellery", "Product_Type_Apparel", (all the other product dummies) ]], missing='drop')

Is there a more efficient way to do this? In stata, I used the prefix i.Product_Type which would signal to the software that the String variable had to be considered as a categorical one... anything similar?


Use str.contains to find the columns that contain "Product_*", and accessing them becomes easy.

c = df.columns[df.columns.str.contains('Product')]

If regex is not needed, you can initialise c as

c = df.columns[df.columns.str.contains('Product', regex=False)]

Or, using str.startswith:

c = df.columns[df.columns.str.startswith('Product')]

Or, a list comprehension:

c = [c_ for c_ in df if c_.startswith('Product')]

Finally, access the subset by unpacking c:

subset = df[['X1', 'Number of workers', 'X2', *c]]
reg = sm.OLS(endog=df['Y'], exog=subset, missing='drop')

How To Select Columns Using Prefix/Suffix of Column Names in , For example, if we want to select multiple columns with names of the columns as And then, we will use Pandas' loc function to do the same. where we need to specify the pattern we are interested in as regular expression. However, you may know that the column names start with some prefix or end with some suffix and interested in some of those columns. In such a scenario, basically we are interested in how to select columns using prefix or suffix of columns names in Pandas.


Same idea like what cold provided by using filter

sm.OLS(endog=df['Y'], 
       exog=df.filter(regex=r'X1|X2|Number|Product'), 
       missing='drop')

[PDF] 057-30: Techniques for Effectively Selecting Groups , If the variable names have the same prefix and a sequential numeric suffix the task is easy. However, when there is no discernable pattern in the variable names  Rename multiple pandas dataframe column names. Commander Date Score; Cochice: Jason: 2012, 02, 08: 4: Pima: Molly: 2012, 02, 08: 24: Santa Cruz


Using the statsmodels.formula.api you don't need to generate the dummies yourself. Remove spaces from you column names and reference the Categorical column with C(col_name)

import statsmodels.formula.api as smf

df = df.rename(columns={'Product Type': 'Product_Type',
                        'Number of workers': 'Number_of_workers'})

results = smf.ols(formula = 'Y ~ X1 + X2 + Number_of_workers + C(Product_Type)', 
                  data=df, missing='drop').fit()

Sample Data

import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Y': np.random.randint(1,100,200),
                   'X1': np.random.normal(1,20,200),
                   'X2': np.random.normal(-10,1,200),
                   'Number of workers': np.arange(1,201,1)/10,
                   'Product Type': np.random.choice(list('abcde'), 200)})

Output of results.summary()

========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               69.2836     23.105      2.999      0.003      23.711     114.856
C(Product_Type)[T.b]    11.3334      6.941      1.633      0.104      -2.356      25.023
C(Product_Type)[T.c]     1.3745      6.943      0.198      0.843     -12.321      15.070
C(Product_Type)[T.d]     2.0430      6.258      0.326      0.744     -10.300      14.386
C(Product_Type)[T.e]     3.8445      6.273      0.613      0.541      -8.528      16.217
X1                       0.0207      0.113      0.184      0.854      -0.202       0.243
X2                       1.4677      2.177      0.674      0.501      -2.825       5.761
Number_of_workers       -0.5803      0.369     -1.573      0.117      -1.308       0.147
==============================================================================

Notice, that with the formulas api since your products create a complete basis it will automatically drop one of the categories since we have the intercept, similar to what you would find in stata.

How to Rename Several Excel Columns at Once, Rename Variables at Once, Rename Several Variables, Add Prefix, Add Column names editing in Excel spreadsheets is a relatively simple task. make the work more efficient, if you need to edit multiple columns in the same way - add If you click on it, the setting options will be displayed in the Properties Panel - Prefix,  Selecting a group of variables that have the same prefix and whose suffixes form a numbered sequence is easy to accomplish. For instance, if the variables are A1, A2, A3,…, AN then they can be referenced as A1 – AN.


pandas.DataFrame.join, Efficiently join multiple DataFrame objects by index at once by passing a list. Column or index level name(s) in the caller to join on the index in other , otherwise joins index-on-index. outer: form union of calling frame's index (or column if on is specified) with other 's Suffix to use from left frame's overlapping columns. I know how PROC transpose works on a single column. But when you have 10 columns, how do you transpose them? I have a data set like the following: ID date1 date2 date3 date4 date5 rec1 rec2 rec3 rec4 rec5 I need to create a data set like this: ID date rec So, in the original data set, I have ev


Pivot data from wide to long, A string specifying the name of the column to create from the data stored in the column Can be a character vector, creating multiple columns, if names_sep or names_sep takes the same specification as separate() , and can either be a This effectively converts explicit missing values to implicit missing values, and  Each row does have - let's say - 41 columns of which column 1 is a unique identifier, the next 20 columns are customer characteristics and the next 20 columns are the same as the 20 columns before (same names but with a fixed prefix) and could contain - per column - a different value.


Rename Transform, Overview · Comparison Operators · EQUAL Function · GREATERTHAN This transform supports multiple methods for renaming two or more columns in a single step. NOTE: For renames of one or more columns to explicit names, specify the For batch rename using prefixes, this parameter specifies the string value with  <tidy-select> Columns to pivot into longer format. names_to: A string specifying the name of the column to create from the data stored in the column names of data. Can be a character vector, creating multiple columns, if names_sep or names_pattern is provided. In this case, there are two special values you can take advantage of: