Python pandas: how to specify data types when reading an Excel file?

pandas read excel multiple sheets
how to read particular column in excel using python pandas
pandas read excel example
pandas read excel skip columns
pandas read excel skip rows
python - read excel file xlsx
pandas excel
read excel file row by row in python pandas

I am importing an excel file into a pandas dataframe with the pandas.read_excel() function.

One of the columns is the primary key of the table: it's all numbers, but it's stored as text (the little green triangle in the top left of the Excel cells confirms this).

However, when I import the file into a pandas dataframe, the column gets imported as a float. This means that, for example, '0614' becomes 614.

Is there a way to specify the datatype when importing a column? I understand this is possible when importing CSV files but couldn't find anything in the syntax of read_excel().

The only solution I can think of is to add an arbitrary letter at the beginning of the text (converting '0614' into 'A0614') in Excel, to make sure the column is imported as text, and then chopping off the 'A' in python, so I can match it to other tables I am importing from SQL.

You just specify converters. I created an excel spreadsheet of the following structure:

names   ages
bob     05
tom     4
suzy    3

Where the "ages" column is formatted as strings. To load:

import pandas as pd

df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
       names ages
   0   bob   05
   1   tom   4
   2   suzy  3

Specify datatype when reading in excel data to pandas/python , If there are NaN values in the column, column dtype will be promoted to float. Please My excel file looks like this NaN gdp As stated above, since the column contains NaN value, column type will be converted into float. Read Excel files (extensions:.xlsx, .xls) with Python Pandas. To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure. Related course: Data Analysis with Python Pandas. Excel. In this article we use an example Excel file. The programs we’ll make reads Excel into Python.

Starting with v0.20.0, the dtype keyword argument in read_excel() function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv() case.

Using converters and dtype arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.


1) Inorder for it to not interpret the dtypes but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str or object so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)

pd.read_excel('file_name.xlsx', dtype=str)            # (or) dtype=object

2) It even supports a dict mapping wherein the keys constitute the column names and values it's respective data type to be set especially when you want to alter the dtype for a subset of all the columns.

# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})

pandas.read_excel, Supports xls , xlsx , xlsm , xlsb , and odf file extensions read from a local filesystem or URL. Supports an dtypeType name or dict of column -> type, default None. Data type for Indicate number of NA values placed in non-​numeric columns. If you'd like to learn more about other file types, we've got you covered: Reading and Writing JSON Files in Python with Pandas; Reading and Writing CSV Files in Python with Pandas; Reading and Writing Excel Files in Python with Pandas. Naturally, to use Pandas, we first have to install it. The easiest method to install it is via pip.

The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:

Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

Example code:

pandas.read_excel(my_file, converters = {my_str_column: str})

pandas.read_excel, file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. If False, all numeric data will be read in as floats: Excel stores all numbers as floats internally. Returns: Column types are inferred but can be explicitly specified. 5. Reading Excel File without Header Row. If the excel sheet doesn’t have any header row, pass the header parameter value as None. excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Numbers', header=None) If you pass the header value as an integer, let’s say 3. Then the third row will be treated as the header row and the values will be read from the next row onwards.

In case if you are not aware of the number and name of columns in dataframe then this method can be handy:

column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
    column_list.append(i)
converter = {col: str for col in column_list} 
df_actual = pd.read_excel(file_name, converters=converter)

where column_list is the list of your column names.

Pandas Excel: Get the data types of the given excel data fields , Pandas Excel Exercises, Practice and Solution: Write a Pandas program to get the data types of the given excel data Next: Write a Pandas program to read specific columns from a given excel file. What is the difficulty level of this exercise​? Easy Medium Hard. Test your Python skills with w3resource's quiz. Here, Pandas read_excel method read the data from the Excel file into a Pandas dataframe object. We then stored this dataframe into a variable called df. When using read_excel Pandas will, by default, assign a numeric index or row label to the dataframe, and as usual when int comes to Python, the index will start with zero.

If you don't know the column names and you want to specify str data type to all columns:

table = pd.read_excel("path_to_filename")
cols = table.columns
conv = dict(zip(cols ,[str] * len(cols)))
table = pd.read_excel("path_to_filename", converters=conv)

Python pandas: how to specify data types when reading an Excel file?, Python pandas: how to specify data types when reading an Excel file? pandas read excel multiple sheets how to read particular column in excel using python  Data Analysis with Python Pandas. Read Excel column names. We import the pandas module, including ExcelFile. The method read_excel () reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet. The list of columns will be called df.columns.

Pandas: How to Read and Write Files – Real Python, The parameter index_col specifies the column from the CSV file that You can read and write Excel files in Pandas, similar to CSV files. There are still more file types that you can write to, so this list is not exhaustive. Maybe Excel files. Or .tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas. Pandas data structures. There are two types of data structures in pandas: Series and DataFrames.

Data Types and Formats – Data Analysis and Visualization in , DataFrame. Define the two main types of data in Python: text and numerics. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. This is almost what I was looking for, in that my real Excel files have all sorts of information in the first x rows, so by doing pd.read_excel('C:\Users\MyFolder\MyFile.xlsx', sheetname='Sheet1') I would pick that information up anyway. That's why I explicitly asked for ways in which a specific value could be looked up.

Pandas Excel Tutorial: How to Read and Write Excel files -, xlsx again. In the Pandas read_excel example below we use the dtype parameter to set the data type of  Read an Excel file into a pandas DataFrame. Supports xls, xlsx, xlsm, xlsb, and odf file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets. Parameters io str, bytes, ExcelFile, xlrd.Book, path object, or file-like object

Comments
  • I had understood 'converters' specified a function to apply to the column. evidently I was wrong - thank you for pointing this out, it's very useful!
  • Where can I find the list of allowable converter functions? I see str here, but presumably there's int and a few more besides - is there a link anywhere to the source docs that enumerates the possible converter functions available?
  • I have not found a list either. Since "converters" accepts functions, I suspect that your imagination is the limit, just so you keep within the bounds of the "converters" functionality (i.e. it was designed to use functions that require only one input variable!).
  • Oddly, when I set a column name to str in the converters dict and then print df.dtypes, the type for that column is set to object not str. Any ideas? Is it even important?
  • @mhyousefi This is not important (on the surface at least). When setting column types as strings Pandas refers to them as objects. See HYRY's answer here
  • This should be the accepted answer as "converters" seem to convert data AFTER reading it as a different type. This leads to information loss ("001" will be read as int("001")="1" and then converted to str. But "001" != "1") . At least that is what happended in my case, correct me if i'm wrong.
  • If we are not aware of number of columns present in the sheet, is there any way to apply it to every column while reading?
  • Got the solution: converters = {col: str for col in column_list} df = pd.read_excel('some_excelfile.xls', converters=converters)
  • can you do it by index or do you need the name? e.g, i'm reading my file in without headers.
  • @rrs, you can just use an integer as the key instead of the column name.
  • Just wonder if df = df.astype(str) would not be better (simpler).
  • Why do you create a list first? Maybe more efficient to use: conv = {x:str for x in pd.read_excel(fn,sheet_name='sheet1').columns} and then df = pd.read_excel(fn,sheet_name='sheet1',converters=conv)
  • Also It might be useful to add nrows=1 in the first read_excel call to avoid having to read the whole excel table only to get the headers.