Getting error while trying to read csv using pandas Python due to extra column values

Related searches

Here is the scenario that I am trying to rid of: I am trying to read the following type of csv:

para1,para2,para3,para4
1,2,3,4,
1,2,3,4,5,
1,2,3,4,
2,3,4,5,6,7,8,9,0,

I am using the following command and getting the following error:

>>> import pandas as pd
>>> df =pd.read_csv("test.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "C:\Python35\lib\site-packages\pandas\io\parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

I tried to search for the issue and got this thread on SO: Python Pandas Error tokenizing data

So, I tried. This is not what I was expecting. It is truncating the values.

>>> df =pd.read_csv("test.csv",error_bad_lines=False)
b'Skipping line 3: expected 4 fields, saw 5\nSkipping line 5: expected 4 fields, saw 9\n'
>>> df


para1  para2  para3  para4
0      1      2      3      4
1      1      2      3      4

What I wanted is something like this: if there are extra values, then take the columns as the integer values with the highest column found in extra. then make the rest of the values as zero(0) till the last column and read the csv.

The output I am expecting is something like this:

>>> df =pd.read_csv("test.csv")
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  NaN  NaN  NaN  NaN  NaN
1      1      2      3      4  5.0  NaN  NaN  NaN  NaN
2      1      2      3      4  NaN  NaN  NaN  NaN  NaN
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0
>>> df = df.fillna(0)
>>> df
   para1  para2  para3  para4    0    1    2    3    4
0      1      2      3      4  0.0  0.0  0.0  0.0  0.0
1      1      2      3      4  5.0  0.0  0.0  0.0  0.0
2      1      2      3      4  0.0  0.0  0.0  0.0  0.0
3      2      3      4      5  6.0  7.0  8.0  9.0  0.0

But please take a note of, I do not want to take care of the column. Instead the program must automatically understand and make the column headers as given above.

Second, please try to avoid suggesting me to write the header. As there can be number of columns where I might not able to write the header but just leave it as it is. so the missing column header will be the number integer as stated above. Do someone have any solution for the query, please let me know?

Try using the below code, use sep=' ', then an iloc which gets first column, then simply str.split and expand=True which does a new dataframe, and then a fillna replacing NaNs, and then the final line is to name the columns with a list comprehension and a list(range(...)).

So you should use:

df = pd.read_csv("test.csv", sep='  ')
df2 = df.iloc[:, 0].str.replace(',$', '').str.split(',', expand=True).fillna(0)
dd = df.columns[0].split(',')
ff = [str(x) for x in range(len(df2.columns) - len(dd))]
df2.columns = dd + ff
print(df2)

Getting error while trying to read csv using pandas Python , Getting error while trying to read csv using pandas Python due to extra column values #26470. Closed. JafferWilson opened this issue on May� I&#39;m running a Python 3.7 script in AWS Lambda, which runs queries against AWS Athena and tries to download the CSV results file that Athena stores on S3 once the query execution has completed.

I'm not sure if there is a cleaner way to do this, but I tested it out and it works using just pandas:

df = pd.read_csv('test.csv', header=None, sep='\n')
df= df[0].str.split(',', expand=True)
new_header = df.iloc[0].fillna(df.columns.to_series())
df = df[1:]
df.columns = new_header

read_csv: Rows with extra columns set data in index � Issue #15203 , Code Sample from io import StringIO import pandas as pd # Notice that the read_csv: Rows with extra columns set data in index #15203 in read(self, nrows) 1506 def read(self, nrows=None): 1507 try: We error. While it does not throw data in the output, this would make the Linked pull requests. Recent in Python. is there a way to send sql query data from redshift and send the data via email to sns? 2 days ago; trying to build a snake game with turtle module but keeps saying 'turtle' has no attribute 'direction' 4 days ago; How to get all links and all elements that are clickable on a page of system dynamically? 4 days ago

Ok, that means that you will have to parse the file until its end to get the actual number of columns, because pandas.read_csv has no provision for that requirement.

If high performance is not a concern (*), a simple way is to rely on the good old csv module and dynamically add columns as needed:

with open('test.csv') as fd:
    rd = csv.reader(fd)
    header = next(rd)     # initialize column names from first row
    next_key = 0          # additional columns will start at '0'
    data = {k: list() for k in header}  # initialize data list per column
    for row in rd:
        while len(row) > len(header):    # add eventual new columns
            header.append(str(next_key))
            data[header[-1]] = [np.nan] * len(data[header[0]])
            next_key += 1                # increase next column name
        # eventually extend the row up to the header size
        row.extend([np.nan] * (len(header) - len(row)))
        # and add data to the column lists
        for i, k in enumerate(header): data[k].append(row[i])

# data is now in a dict format, suitable to feed DataFrame
df = pd.DataFrame(data)

(*) above code will not be very efficient because it adds element to lists one at a time. This would be terrible for pandas DataFrame, and is not very very nice even for Python lists. It could be improved by allocating bunches in numpy.ndarray but at the price of increased complexity.

Dealing with extra white spaces while reading CSV in Pandas, Abandon the regex separator due to quotes issue; Apply regex on the loaded dataFrame; Test the str.strip() applied column wise on all string� Load a csv while specifying “.” as missing values df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.']) pd.isnull(df) Load a csv while specifying “.” and “NA” as missing values in the Last Name column and “.” as missing values in Pre-Test Score column

Importing Data with Pandas' read_csv(), I have a CSV file for word bank population data, I am trying tor read the csv file using pandas read_csv method and getting this error: '� Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. index_col str or list of str, optional, default: None

pandas.read_csv — pandas 0.21.1 documentation, pandas. read_csv (filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, New in version 0.18.1: support for the Python parser. Dict of functions for converting values in certain columns. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. list of lists. e.g. If [[1� I'm trying to convert a data-frame to csv file and while I do that I get the following ) : cannot open file 'packages.csv': Permission denied

This is the most active category where discussions regarding Python general take place. You can discuss about errors in programming logic, or about Python’s use with Selenium or about the use of Python in analytics. Python is a dynamic, strongly typed, object-oriented, multipurpose programming language which is very easy to learn, and understand thanks to its clean, uniform syntax. It has

Comments
  • Not efficient solution but would work.. you can read csv file without header first then 2nd time you can read only headers
  • @AkshayNevrekar May be but is there no solution with pandas available?
  • The above is pandas solution only....maybe you meant to ask if it can be done by tweaking some parameteres in read_csv which at least I am not aware of.
  • @AkshayNevrekar Yeah, something that sort of
  • I have edited the sample as I forgot to add the extra commas there
  • How I can identify whether to add list(range(5)) or list(range(10)) or anything? if I have a large dataset I cannot even thing of opening it and count how much to add. Can you help me with that?
  • @JafferWilson Edited mine again
  • @JafferWilson Happy to help again :-)
  • assuming 1,"2,2,2",3 are 5 columns and not 3 like with a real CSV parser
  • I agree it won't work if your data is structured like that would need a better solution, but it worked for the OP's problem data he posted.
  • May be this will help. But what you mean by If performance is not a concern (*)? Obviously, performance is a considered concern.
  • @JafferWilson: What I mean is that this code does not pretend to be as efficient as the C parser of read_csv. On a fairly recent computer, it should be able to process csv files up to several Mbytes. If you intend to process many files of several Gbytes, it would deserve low level optimization.
  • @JafferWilson: ...Said differently, my advice is that you should give it a try, and only if performance is not acceptable then try to use pre_allocation to speedup things.
  • Sure I will give it a try