Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

pandas read_csv
pandas.errors.parsererror: error tokenizing data. c error: out of memory
error tokenizing data c error buffer overflow caught possible malformed input file panda
parsererror error tokenizing data c error: expected 7 fields in line 3 saw 9
pandas parser textreader _read_low_memory
dask error tokenizing data c error
error tokenizing data c error expected 2 fields in line 43043 saw 3
error tokenizing data c error: eof inside string starting at row

I am using conda python 2.7

python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)

I have fallowing method to read large gzip files:

df = pd.read_csv(os.path.join(filePath, fileName),
     sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)

but when I read the file I get the following error:

pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11

I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=False option.

What are my options here?

Found something interesting when I tried to uncompress the file:

gunzip -k myfile.txt.gz 
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed

I didn't really find a python solution but using unix tools I manage to find a solution:

First I use zless myfile.txt.gz > uncompressedMyfile.txt then I use sed tool to remove the last line because I clearly saw that last line was corrupt.

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df

read_csv C-engine CParserError: Error tokenizing data · Issue , C error: Buffer overflow caught - possible malformed input file. If you try and read the CSV using the python engine then no exception is thrown: Error tokenizing data. C error: Calling read(nbytes) on source failed. C error: Buffer overflow caught - possible malformed input file. If you try and read the CSV using the python engine then no exception is thrown: df . read_csv ( 'faulty_row.csv' , encoding = 'utf8' , engine = 'python' )

The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

ParserError: Error tokenizing data. C error: Calling read(nbytes) on , I read a 16 GB CSV file with pandas. It used to work well, tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'. When using read_csv like this: df = pd.read_pickle('faulty_row.pkl') df.to_csv('faulty_row.csv', encoding='utf8', index=False) df.read_csv('faulty_row.csv', encoding

Sometimes the error shows up if you have the file already open. Try closing the file and re-running

error reading test set, I was having some issue opening the test data, (on python 3) C error: EOF inside string starting at line 145998 So some rows are not read from test file. test = pd.read_table("test.tsv") and got the same error "ParserError: Error tokenizing data. Do we have to unzip the file ? input/test.tsv", stringsAsFactors = FALSE). Sorry for the late response, had a look at the csv there were some unicode characters like \r, -> etc that led to unexpected escapes. Replacing them in the source did the trick.

Error tokenizing data. C error: Calling read(nbytes) on source failed , C error: Calling read(nbytes) on source failed. Try engine='python' - python. I get this error when I try to open the CSV file (occupies 500Mb): ParserError: Error The size of the inputs is about 1.5gb and the number of files are more than 60K​  The solution was to use the parameter engine=’python’ in the read_csv function call. The Pandas CSV parser can use two different “engines” to parse CSV file – Python or C (default). The Pandas CSV parser can use two different “engines” to parse CSV file – Python or C (default).

Error+tokenizing+data+C+error+Calling+read+nbytes+on+source+ , CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file. Solution 1. You can read the CSV using the python engine then  An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

Python Pandas: read_csv C-engine CParserError: Error tokenizing , 'c,d,3') : In [4]: pd.read_csv(StringIO(data)) Out[4]: col1 col2 col3 0 a b 1 1 a b 2 If using 'zip', the ZIP file must contain only one data file to be read in. a variety of columns and date/time formats to turn the input text data into datetime objects. _libs.parsers.raise_parser_error() ParserError: Error tokenizing data. I am having trouble with read_csv (Pandas 0.17.0) when trying to read a 380+ MB csv file. The file starts with 54 fields but some lines have 53 fields instead of 54. Running the below code gives me the following error: parser = lambda x:

Comments
  • No way we can know with out the data file or sample data.
  • Have you tried using the Python engine option for reading your data, as suggested by the error message?
  • Have you tried to add engine='python' as the error message suggests?
  • PS: pandas version?
  • @Boud pandas 0.17.1 np110py27_0 that what conda gives also tried the engine no effect