Pandas cast all object columns to category

pandas astype
pandas change column type to string
pandas categoricaldtype
get categorical columns pandas
pandas convert string column to int
pandas convert all float columns to int
pandas change all object to category
object columns pandas

I want to have ha elegant function to cast all object columns in a pandas data frame to categories

df[x] = df[x].astype("category") performs the type cast df.select_dtypes(include=['object']) would sub-select all categories columns. However this results in a loss of the other columns / a manual merge is required. Is there a solution which "just works in place" or does not require a manual cast?

edit

I am looking for something similar as http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html for a conversion to categorical data

use apply and pd.Series.astype with dtype='category'

Consider the pd.DataFrame df

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))
df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null object
C    4 non-null int64
D    4 non-null object
dtypes: int64(2), object(2)
memory usage: 200.0+ bytes

Lets use select_dtypes to include all 'object' types to convert and recombine with a select_dtypes to exclude them.

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex_axis(df.columns, axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null category
C    4 non-null int64
D    4 non-null category
dtypes: category(2), int64(2)
memory usage: 208.0 bytes

pandas.Series.astype — pandas 1.0.5 documentation, All values of categorical data are either in categories or np.nan . Categorical Series or columns in a DataFrame can be created in several ways: object which is not categorical data, you need to be explicit and convert the categorical data� Categorical function is used to convert / typecast integer or character column to categorical in pandas python. Typecast a numeric column to categorical using categorical function (). Convert a character column to categorical in pandas Let’s see how to Typecast column to categorical in pandas python using categorical () function

I think that this is a more elegant way:

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))

df.info()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

df.info()

Categorical data — pandas 1.0.5 documentation, Use a numpy.dtype or Python type to cast entire pandas object to the same type. dtype: object. Cast all columns to int32: ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2]. DataFrame.astype () method is used to cast a pandas object to a specified dtype. astype () function also provides the capability to convert any suitable existing column to categorical type. DataFrame.astype () function comes very handy when we want to case a particular column data type to another data type.

Wish I could add this as a comment, but can't.

The accepted answer doesn't work for pandas version 0.25 and higher. Use .reindex instead of reindex_axis. See here for more information: https://github.com/scikit-hep/root_pandas/issues/82

pandas.Series.astype — pandas 0.25.1 documentation, or dict of column name -> data type. Use a numpy.dtype or Python type to cast entire pandas object to the same type. Note: Also that when this original answer was written creating a categorical then setting it to a column, the column was converted to object (or another dtype), as you couldn't (until 0.15) have categorical columns/Series.

Often the order of categories has meaning, for example t-short sizes 'S', 'M', 'L' 'XL' are ordered categories (in SPSS - ordinals). If you are interested in creating ordered categories from strings you can use this code:

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Categorical, ordered=True)
        ], axis=1).reindex(df.columns, axis=1)

In the resulting DataFrame categorical columns can be sorted by values the same way as you used to sort strings.

pandas.DataFrame.astype — pandas 1.0.5 documentation, To select strings you must use the object dtype, but note that this will return all object dtype columns. See the numpy� Similar to the previous section where a single column was converted to categorical, all columns in a DataFrame can be batch converted to categorical either during or after construction. This can be done during construction by specifying dtype="category" in the DataFrame constructor:

pandas.DataFrame.select_dtypes — pandas 1.0.5 documentation, Attempt to infer better dtypes for object columns. Attempts soft Convert argument to best possible dtype. Examples. Let’s see the different ways of changing Data Type for one or more columns in Pandas Dataframe. Method #1: Using DataFrame.astype() We can pass any Python, Numpy or Pandas datatype to change all columns of a dataframe to that type, or we can pass a dictionary having column names as keys and datatype as values to change type of selected columns.

pandas.DataFrame.infer_objects — pandas 1.0.5 documentation, Using The Pandas Category Data Type non-null object Covered_Recipient_Type 607865 non-null object . We can use a loop to convert all the columns we care about using� dtypedata type, or dict of column name -> data type Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

Using The Pandas Category Data Type, DataFrame.astype() method is used to cast a pandas object to a specified dtype. astype() of all columns after change. infer_objects() Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions). For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

Comments
  • Indeed this is a great start. But I only want to convert object dtype and not float or integer as your solution "brute-forcely" converts anything to category
  • This: df.select_dtypes(include=['object']).apply(pd.Series.astype, dtype='category').info() partially works e.g. all objects are converted. But afterwards manually a merge with the numeric columns needs to be performed. How can I prevent this and selectively change the dtypes in place