Convert columns to string in Pandas

Related searches

I have the following DataFrame from a SQL query:

(Pdb) pp total_rows
     ColumnID  RespondentCount
0          -1                2
1  3030096843                1
2  3030096845                1

and I want to pivot it like this:

total_data = total_rows.pivot_table(cols=['ColumnID'])

(Pdb) pp total_data
ColumnID         -1            3030096843   3030096845
RespondentCount            2            1            1

[1 rows x 3 columns]


total_rows.pivot_table(cols=['ColumnID']).to_dict('records')[0]

{3030096843: 1, 3030096845: 1, -1: 2}

but I want to make sure the 303 columns are casted as strings instead of integers so that I get this:

{'3030096843': 1, '3030096845': 1, -1: 2}

One way to convert to string is to use astype:

total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)

However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):

In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])

In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'

In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'

Note: you can pass in a buffer/file to save this to, along with some other options...

pandas.DataFrame.to_string — pandas 1.1.1 documentation, If None, the output is returned as a string. If a list of strings is given, it is assumed to be aliases for the column names. Convert DataFrame to HTML. astype () method doesn’t modify the DataFrame data in-place, therefore we need to assign the returned Pandas Series to the specific DataFrame column. We could also convert multiple columns to string simultaneously by putting columns’ names in the square brackets to form a list.

If you need to convert ALL columns to strings, you can simply use:

df = df.astype(str)

This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):

 df[["D", "E"]] = df[["D", "E"]].astype(int) 

pandas.DataFrame.to_string — pandas 1.0.0 documentation, If None, the output is returned as a string. If a list of strings is given, it is assumed to be aliases for the column names. Convert DataFrame to HTML. Often you may wish to convert one or more columns in a pandas DataFrame to strings. Fortunately this is easy to do using the built-in pandas astype (str) function. This tutorial shows several examples of how to use this function. Example 1: Convert a Single DataFrame Column to String

Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:

In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
    ...:     'A': [20, 30.0, np.nan],
    ...:     'B': ["a45a", "a3", "b1"],
    ...:     'C': [10, 5, np.nan]})
    ...: 

In [79]: df.dtypes ## Current datatype
Out[79]: 
A    float64
B     object
C    float64
dtype: object

## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str) 

In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]: 
A    object
B    object
C    object
dtype: object

How to convert multiple columns to string in pandas dataframe , To convert all columns into string, you need to construct the list of columns: all_columns = list(df) # Creates list of all column headers� import pandas as pd import numpy as np df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A']) df1 = df['A'].astype(str) df2 = df['A'].apply(str) print df.isnull() print df1.isnull() print df2.isnull()

pandas >= 1.0: It's time to stop using astype(str)!

Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:

# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will 
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')

From pandas 1.0 onwards, consider using "string" type instead.

# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype

Here's why, as quoted by the docs:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clear than 'string'.

See also the section on Behavioral Differences between "string" and object.

Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.


Why should I stop using it?

Accidentally mixing dtypes The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.

# pandas <= 0.25
pd.Series(['a', 'b', 1.23])   # whoops, this should have been "1.23"

0       a
1       b
2    1.23
dtype: object

pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23]   # oops, pandas was storing this as float all the time.
# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")

0       a
1       b
2    1.23
dtype: string

pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23']   # it's a string and we just averted some potentially nasty bugs.

Challenging to differentiate strings and other python objects Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.

Consider,

# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
 
   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.

# pandas <= 0.25  
df.dtypes

A    object
B    object
dtype: object

df.select_dtypes(object)

   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

From pandas 1.0, this becomes a lot simpler:

# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes

A    string
B    object
dtype: object

df.select_dtypes("string")

   A
0  a
1  b
2  c

Readability This is self-explanatory ;-)


OK, so should I stop using it right now?

...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!

How to Convert DataFrame Column to String in Pandas, Pandas Series astype(dtype) method converts the Pandas Series to the specified dtype type. It converts the Series, DataFrame column as in this article, to string . astype() method doesn't modify the DataFrame data in-place, therefore we need to assign the returned Pandas Series to the specific DataFrame column. When I read a csv file to pandas dataframe, each column is cast to its own datatypes. I have a column that was converted to an object. I want to perform string operations for this column such as splitting the values and creating a list. But no such operation is possible because its dtype is object.

Using .apply() with a lambda conversion function also works in this case:

total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))

For entire dataframes you can use .applymap(). (but in any case probably .astype() is faster)

pandas.DataFrame.to_string¶ DataFrame. to_string ( buf = None , columns = None , col_space = None , header = True , index = True , na_rep = 'NaN' , formatters = None , float_format = None , sparsify = None , index_names = True , justify = None , max_rows = None , min_rows = None , max_cols = None , show_dimensions = False , decimal = '.' , line_width = None , max_colwidth = None , encoding = None ) [source] ¶

Need to convert integers to strings in pandas DataFrame? Depending on your needs, you may use either of the 3 methods below to perform the conversion: (1) Convert a single DataFrame Column using the apply(str) method: df['DataFrame Column'] = df['DataFrame Column'].apply(str) (2) Convert a single DataFrame Column using the astype(str) method:

while trying to convert particulars column from object to string using astype()[with str, |S, |S32, |S80] types, or directly using str functions it is not converting in string (remain object) and for str methods[replacing '/' with ' '] it says AttributeError: 'DataFrame' object has no attribute 'str' using pandas 0.23.4

You may use this template in order to convert strings to datetime in Pandas DataFrame: df['DataFrame Column'] = pd.to_datetime(df['DataFrame Column'], format=specify your format) Note that the strings must match the format specified.

Comments
  • From pandas 1.0, the documentation recommends using astype("string") instead of astype(str) for some pretty good reasons, take a look.
  • I think to_string() is preferable due to the preservation of NULLs stackoverflow.com/a/44008334/3647167
  • @Keith null preservation is attractive. but the doc says its purpose is to 'Render a DataFrame to a console-friendly tabular output'. i'd like someone authoritative to weigh in
  • to_json() probably does not call astype(str) as it leaves datetime64 and its subclasses as milliseconds since epoch.
  • @Sussch I suspect that's because json doesn't have an explicit datetime format, so you're kinda forced to use epoch. Which is to say, I think that's the standard.
  • This works if source is a,b,c and fails if source is 1,2,3 etc.
  • @Nages I hope so, it generally doesn't make sense to represent numeric data as text.
  • That is right. But some times like it happens if you are trying to solve Kaggle titanic competition where Pclass is represented as 1,2 and 3. Here it should be categorical like string format instead of numeric. To solve this problem str has helped instead of string in that case. Any way thanks it works for characters. Thanks for sharing this documentation details.