Get pandas.read_csv to read empty fields as NaN, and empty strings as empty strings
This is quite the opposite of Get pandas.read_csv to read empty values as empty string instead of nan
Given the following CSV file:
col,val "hi there",1 ,2 \f\,3 "",4 """hi""",5
I want it to be read as:
col val 0 hi\nthere 1 1 NaN 2 2 \f\ 3 3 4 4 "hi" 5
That is, reading the empty field (val 2) as
NaN, while keeping the empty string (val 4) as empty string.
pd.read_csv converts val 2 and val 4 both as
NaN, or if I use
na_filter=False both are kept as empty string.
I'm assuming these two representations mean different things in CSV (empty fields vs empty string), so I'm assuming pandas should be able to distinguish this too.
Is there a way to make pandas to distinguish these two cases? Or is my assumption wrong, that the two representations are actually the same? (please point me to a CSV standard if the second one is the case)
More information, I got the CSV by exporting BigQuery table (with the intended meaning, val 2 is null and val 4 is empty string) into CSV. And I want to get the exact same table back. So this example is not just a contrived example, but is actually used by BigQuery when exporting to CSV.
EDIT: further search reveals a Github issue 4 years ago that discusses similar point (see this comment, for example), and one of the commenter mentions that there is some coercion (I'm not sure what they refer to, but I understand it as coercion between empty field and empty string). Is this still happening?
Another option would be do disable the quoting to get the fields where an empty string is present and where nothing is present. The problem in this case are the entries which include new line characters in the test. We would need to remove these chars first and merge the lines to create a new data file.
When reading the new data file with quoting off, empty values are NaN and empty strings are two quotes. This dataframe then can be used to set the NaN in the original dataframe to set the real NaNs.
import numpy as np import pandas as pd with open('./data.csv') as f: lines = f.readlines() # merge lines where the comma is missing it = iter(lines) lines2 = [x if ',' in x else x + next(it) for x in it] # replace \n which are not at the end of the line lines3 = [l.replace('\n','') + '\n' for l in lines2] # write new file with merged lines with open('./data_merged.csv', 'w+') as f: f.writelines(lines3) # read original data df = pd.read_csv('./data.csv', na_filter=False) # read merged lines data with quoting off df_merged = pd.read_csv('./data_merged.csv', quoting=3) # in df_merged dataframe if is NaN it is a real NaN # set lines in original df to NaN when in df_merged is NaN df.loc[df_merged.col.isna(), 'col'] = np.NaN
Get pandas.read_csv to read empty values as empty string instead , It correctly reads "nan" as the string "nan', but still reads the empty cells as NaN. I tried passing in str in the converters argument to read_csv (with� I'm using the pandas library to read in some CSV data. In my data, certain columns contain strings. The string "nan" is a possible value, as is an empty string. I managed to get pandas to read "nan" as a string, but I can't figure out how to get it not to read an empty value as NaN. Here's sample data and output. One,Two,Three. a,1,one. b,2,two
pandas.read_csv accepts a
quoting argument that controls quoting behavior for each field. The argument accepts values of type
csv.QUOTE_*. The latter are constants defined in the csv module. Of all the available options, the one to take note is csv.QUOTE_NONE. This constant instructs the reader object to perform no special processing of quote characters, which means that fields in double quotes are read as they are and no additional double quotes are added to the fields while parsing. The default value set by pandas is csv.QUOTE_MINIMAL.
In : import csv In : import pandas as pd In : df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE) In : df Out: col val 0 "hi NaN 1 there" 1.0 2 NaN 2.0 3 \f\ 3.0 4 "" 4.0 5 """hi""" 5.0
With no special quoting, null values are parsed as NaN and empty strings with double-quotes are left as they are.
But there is a problem with this approach: if any field contains newlines in double-quotes, they are treated as separate strings. This is evident in the first line in the csv file where "hi\nthere" are parsed in separate rows by pandas. To get around this problem, I first performed some pre-processing with the
re module. This was required to replace any newline characters in double-quote strings to whitepace. Then I wrote back to the same file and used it again as above in
read_csv. As I'm not aware of the full format of your data, there may be more regex required as necessary. However, for the given problem, I get the desired output.
In : with open("test.csv", 'r+') as f: ...: data = f.read() ...: import re ...: pattern = re.compile(r'".*?"', re.DOTALL) ...: data = pattern.sub(lambda x: x.group().replace('\n', ' '), data) ...: f.seek(0) ...: f.write(data) In : df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE) In : df Out: col val 0 "hi there" 1 1 NaN 2 2 \f\ 3 3 "" 4 4 """hi""" 5
Pandas Replace NaN with blank/empty string, It will replace all NaNs with an empty string. import numpy as np. df1 = df.replace( np.nan, '', regex=True). If you wish to learn more about Data� Pandas stores strings (str and unicode) with dtype=object. As such, some unexpected things happen, like empty fields being filled with nan, which is a float. Expected behavior should fill with empty string "" or at least None. >>> import pandas as pd >>> from StringIO import StringIO >>> pd.read_csv (StringIO ('col1,col2,col3 foo,,bar'),dtype=str) col1 col2 col3 0 foo NaN bar >>> type (pd.read_csv (StringIO ('col1,col2,col3 foo,,bar'),dtype=str).iloc [0,1]) float.
Here's a bit ugly but complete answer:
import io import re import pandas as pd with open('overflow.csv', 'r') as f: with io.StringIO(re.sub(r'(^"",)', "EMPTY_STR,", f.read(), flags=re.MULTILINE)) as ff: with io.StringIO(re.sub(r'(,"",)', ",EMPTY_STR,", ff.read(), flags=re.MULTILINE)) as fff: with io.StringIO(re.sub(r'(,""$)', ",EMPTY_STR", fff.read(), flags=re.MULTILINE)) as ffff: df = pd.read_csv(ffff) df= df.replace('EMPTY_STR', '')
re.sub() replaces the empty string with
EMPTY_STR which can later be replaced back with an actual empty string. It has to be called three times for all three possible types of occurrences (beginning, middle and and of the line).
Truly empty cells are left alone and indeed interpreted as
read_csv() fills empty string with nan � Issue #10205 � pandas-dev , Pandas stores strings (str and unicode) with dtype=object. As such, some unexpected things happen, like empty fields being filled with nan,� pandas.read_csv¶ pandas.read_csv (filepath_or_buffer, no strings will be parsed as NaN. (empty strings and the value of na_values). In data without any NAs
Is there any way for you to replace the empty strings with something else when creating the BigQuery csv export? Like replace
"EMPTY_STR"? You could then use a converter function to replace those back to an empty string when using
How to replace each empty string in a pandas DataFrame with NaN , I added a ticket to add an option of some sort here: https://github.com/pydata/ pandas/issues/1450. In the meantime, result.fillna('') should do� na_filter = boolean, default True. It can parse the null values in dataset. When you apply pd.read_csv function in csv file, na_filter detects empty cells, NA, null or missing values and places a NaN values. If your large dataset dosn’t have any null values so you can apply na_filter = False.
Pandas Replace NaN with blank/empty string, Further reading: You can learn more about how to use regular expressions in pandas here. Want to code faster? ⌃. Kite is a plugin for PyCharm, Atom� I already mentioned I can't just read it in without specifying a type, Pandas keeps taking numeric keys which I need to be strings and parsing them as floats. Like I said in the example a key like: 1234E5 is taken as: 1234.0x10^5, which doesn't help me in the slightest when I go to look it up. – daver Jun 7 '13 at 19:00
Dealing with extra white spaces while reading CSV in Pandas, Pandas Replace NaN with blank/empty string, It will replace all NaNs with an string, I have a Pandas Dataframe as shown below: 1 2 3 0 a NaN read 1 b l Those cells will be empty in Excel (you will be able to use 'select empty cells' df = pd.read_csv(filename, keep_default_na=False). pandas.DataFrame.fillna� Why not have it read empty cells as None instead? My problem with NaN is that it returns True when cast to boolean. Is there a good reason for this? Btw, sorry to necrobump this over four years later. Edit: just realized that if you specify keep_default_na=False it reads empty cells as empty strings! 👍
Pandas check if cell is empty, Blank strings, spaces, and tabs are considered as the empty values represented and you can get unexpected results, because for example count(NaN) = 0 while count(" ")= 1 . Did you know that you can use regex delimiters in pandas? read_csv documentation says: Expected 5 fields in line 3, saw 6. If we want to get a count of the number of null fields by column we can use the following code, 0 65.0 NaN BrkFace 196.0 Gd TA No Read More From DZone.