How to keep null values when writing to csv
python write none to csv
how to insert null value in sql from csv
csv blank cells
sql server import csv null values
null character csv
why won't my csv file upload
replace null with blank in csv
I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command. The issue is that Python's csv writer automatically converts Nulls into an empty string "" and it fails my job when the column is an int or float datatype and it tries to insert this "" when it should be a None or null value.
To make it as easy as possible to interface with modules which implement the DB API, the value None is written as the empty string.
What is the best way to keep the null value? Is there a better way to write csvs in Python? I'm open to all suggestions.
I have lat and long values:
42.313270000 -71.116240000 42.377010000 -71.064770000 NULL NULL
When writing to csv it converts nulls to "":
with file_path.open(mode='w', newline='') as outfile: csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC) if include_headers: csv_writer.writerow(col for col in self.cursor.description) for row in self.cursor: csv_writer.writerow(row)
42.313270000,-71.116240000 42.377010000,-71.064770000 "",""
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
What solved the problem for me was changing the quoting to csv.QUOTE_MINIMAL.
csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
Related questions: - Postgresql COPY empty string as NULL not work
You have two options here: change the
csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)
csv.writer() and quoting
On the Python side, you are telling the
csv.writer() object to add quotes, because you configured it to use
writerobjects to quote all non-numeric fields.
None values are non-numeric, so result in
"" being written.
writerobjects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
writerobjects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.
Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.
With either option, the CSV output for
None values is simple an empty string:
>>> import csv >>> from io import StringIO >>> def test_csv_writing(rows, quoting): ... outfile = StringIO() ... csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting) ... csv_writer.writerows(rows) ... return outfile.getvalue() ... >>> rows = [ ... [42.313270000, -71.116240000], ... [42.377010000, -71.064770000], ... [None, None], ... ] >>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC)) 42.31327,-71.11624 42.37701,-71.06477 "","" >>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL)) 42.31327,-71.11624 42.37701,-71.06477 , >>> print(test_csv_writing(rows, csv.QUOTE_NONE)) 42.31327,-71.11624 42.37701,-71.06477 ,
NULL values and
As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as
NULLs, when you use the
FORCE_NULL option. From the
COPY FROM documentation:
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to
NULL. In the default case where the null string is empty, this converts a quoted empty string into
NULL. This option is allowed only in
COPY FROM, and only when using CSV format.
Naming the columns in a
FORCE_NULL option lets PostgreSQL accept both the empty column and
NULL values for those columns, e.g.:
COPY position ( lon, lat ) FROM "filename" WITH ( FORMAT csv, NULL '', DELIMITER ',', FORCE_NULL(lon, lat) );
at which point it doesn't matter anymore what quoting options you used on the Python side.
Other options to consider
For simple data transformation tasks from other databases, don't use Python
If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.
Numpy data? Consider using COPY FROM as binary, directly from Python
Numpy data can more efficiently be inserted via binary
COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using
COPY FROM STDIN WITH BINARY and the
psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.
Persisting data to handle large datasets in a pipeline?
Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.
Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.
Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The
pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.
The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's
pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.
How to write "NULL" into CSV from Excel for blank fields, There is no difference between an Excel cell with an empty string or one with no value (null) for purposes of exporting a CSV. The fields in CSV files don't� Retain NULL values vs Keep NULLs in SSIS Dataflows - Which To Use? Paul Ibison , 2008-02-13 There is some confusion as to what the various NULL settings all do in SSIS.
for row in self.cursor: csv_writer.writerow(row)
uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression
for row in self.cursor: csv_writer.writerow("null" if x is None else x for x in row)
Documentation: 8.3: COPY, On output, the first line contains the column names from the table, and on input, In CSV COPY TO mode, forces quoting to be used for all non-NULL values in each You can use FORCE NOT NULL to prevent NULL input comparisons for� I've got a table with null values in many of the fields. I'd like to keep these fields as I export to .csv but they are being replaced with empty strings. I've tried changing the 'nullable' checkbox for the fields in the wizard, but that doesn't seem to do anything.
You are asking for
csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using
csv.QUOTE_MINIMAL as it might be more what you are after:
import csv test_data = (None, 0, '', 'data') for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC), ('test2.csv', csv.QUOTE_MINIMAL)): with open(name, mode='w') as outfile: csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes) csv_writer.writerow(test_data))
how convert a dataframe with null values to csv? � Issue #1090 , The code is: df.to_csv(path='test', num_files=1) How can set koalas to don't do this for null values? when I canverto it to pandas and save the� The color of the lilac row was the empty string in the CSV file and is read into the DataFrame as null. Per the CSV spec, blank values and empty strings should be treated equally, so the Spark 2.0
I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command.
I believe your true requirement is you need to hop data rows through the filesystem, and as both the sentence above and the question title make clear, you are currently doing that with a csv file. Trouble is that csv format offers poor support for the RDBMS notion of NULL. Let me solve your problem for you by changing the question slightly. I'd like to introduce you to parquet format. Given a set of table rows in memory, it allows you to very quickly persist them to a compressed binary file, and recover them, with metadata and NULLs intact, no text quoting hassles. Here is an example, using the pyarrow 0.12.1 parquet engine:
import pandas as pd import pyarrow def round_trip(fspec='/tmp/locations.parquet'): rows = [ dict(lat=42.313, lng=-71.116), dict(lat=42.377, lng=-71.065), dict(lat=None, lng=None), ] df = pd.DataFrame(rows) df.to_parquet(fspec) del(df) df2 = pd.read_parquet(fspec) print(df2) if __name__ == '__main__': round_trip()
lat lng 0 42.313 -71.116 1 42.377 -71.065 2 NaN NaN
Once you've recovered the rows in a dataframe you're free to call
df2.to_sql() or use some other favorite technique to put numbers and NULLs into a DB table.
If you're able to run
.to_sql() on the PG server, or on same LAN, then do that.
Otherwise your favorite technique will likely involve
The summary is that with psycopg2, "bulk INSERT is slow".
Middle layers like sqlalchemy and pandas, and well-written apps that care about insert performance, will use
The idea is to send lots of rows all at once, without waiting for individual result status, because we're not worried about unique index violations.
So TCP gets a giant buffer of SQL text and sends it all at once, saturating the end-to-end channel's bandwidth,
much as copy_expert sends a big buffer to TCP to achieve high bandwidth.
In contrast the psycopg2 driver lacks support for high performance executemany. As of 2.7.4 it just executes items one at a time, sending a SQL command across the WAN and waiting a round trip time for the result before sending next command. Ping your server; if ping times suggest you could get a dozen round trips per second, then plan on only inserting about a dozen rows per second. Most of the time is spent waiting for a reply packet, rather than spent processing DB rows. It would be lovely if at some future date psycopg2 would offer better support for this.
5 Magic Fixes for the Most Common CSV File Problems, Typically you can use \N to represent NULL values in the data. And if you You can open the file in a text reader and save it again with the UNIX line breaks. 2. As Rohit post, we can set NULL when column value is blank, then use SSMS import/export wizard to keep the nulls. In addition, we can use other methods to import/export data from a excel file to SQL Server 2008 and keep null values, for example, SSIS, Bluk insert or BCP.
I would use pandas,psycopg2,and sqlalchemy. Make sure are installed. Coming from your current workflow and avoiding writing to csv
#no need to import psycopg2 import pandas as pd from sqlalchemy import create_engine #create connection to postgres engine = create_engine('postgres://.....') #get column names from cursor.description columns = [col for col in self.cursor.description] #convert data into dataframe df = pd.DataFrame(cursor.fetchall(),columns=columns) #send dataframe to postgres df.to_sql('name_of_table',engine,if_exists='append',index=False) #if you still need to write to csv df.to_csv('your_file.csv')
Error 'Empty Column is found' in Data Loader, It is still possible to upload blank values into records fields via the "Insert null been removed re-save the file and try loading the CSV again via the Data Loader . Write and name a csv file with the column name of a dataframe Hot Network Questions I'm writing a prologue from the POV of a non-English-speaking character.
Change record values to blank or null with Data Loader, Save the file as a new .CSV with new name. Don't overwrite the Data Export from step 1 so that you can keep it as a backup of your data. 4� I'm trying to write a .csv file from my application and one of the values happens to look enough like a date that Excel is automatically converting it from text to a date. I've tried putting all of my text fields (including the one that looks like a date) within double quotes, but that has no effect.
QGIS 2.0 - Table Joins, The join field must contain matching values (in the same format, such as STRING FIELD (COUN_CODE) to make sure it does NOT have NULL values or blank cells Now save the file with the SAME FILENAME as the .csv file being defined � While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame. Dataframe.isnull() Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame
Salesforce, Save the file as a new .CSV with new name (don't overwrite the Data Export from step 1). 4. Run Data Loader Update using new file with Blank values� pandas 0.12.0. I originally tried it with a list argument first and that presents two different problems: In : data = pd.read_csv('iso_country_continent.csv', keep_default_na=False, na_values=[' ']) In : data.ix[160:165] Out: Unnamed: 0 geo_country country continent 160 160 MX Mexico N_A 161 161 MY Malaysia AS 162 162 MZ Mozambique AF 163 163 Namibia AF 164 164 NC New Caledonia OC 165
- can you share an example? because csv writer can write integers (as strings) and floats (as strings). What do you want to write in place of
None/ "Null" ?
- I ended up using QUOTE_MINIMAL and it worked for most datasets but it created extra columns for others. I'm still looking into why that happens but Jean-François answer along with specifying the null value in the copy command works...albeit very slowly. This is for an intermediate step to store larger datasets that can't fit in memory in a data pipeline and I wonder if you have any suggestions outside of this question. I appreciate your answer regardless.
- @JonathanPorter: it sounds as if you are trying to build a data pipeline from scratch. Have you considered using Apache Spark instead? Spark has excellent, first class Python support, and lets you stream datasets, operate on datasets in parallel or lets you access data via Apache Arrow, a much more efficient format to exchange large amounts of columnar data.
- @JonathanPorter: Jean-François's solution is slow because it transforms each row by hand, and you can't really speed up that operation when you have a structured data source as input (numpy, database query rows, etc) because such sources balk at mixing numeric and string types. You'd be better of with sticking with
FORCE_NULLon the numeric columns instead. Not that I can quite envision what you mean by 'created extra columns', that sounds like a new problem elsewhere, probably a bug in the way you write the CSV file.
- @JonathanPorter: I think I can think of a scenario where you might see extra columns: when your data contains delimiters or quotes in a value and you didn't configure
QUOTEoptions to match the settings used in your writer.
- Thank you for all of these suggestions. I indeed did not configure those options but I will double check them and also look into spark! I'll be sure to post what worked best in my case.
- Although this doesn't solve the not being able to insert varchar values into an int or float column I think it can help.
- sorry I cannot help more without more input. I don't know why int or float doesn't work as it should with csv. And I don't know what varchar is.
- Note that
DataFrame.to_sql()is really very slow indeed. Bulk inserts into PostgreSQL is still best done with COPY FROM.
- Why complicate matters with additional software?
DataFrame.to_csv()still uses the
csv.writer()object to do the actual writing to a file.
- From what I understand, the only reason Nulls are a problem is that OP can't insert them into Postgres. Does OP really need to save them in CSV? Or is OP just using CSV as intermediary step. Regardless, pandas is good tool to use specially if OP needs to transform the data in the future. Complicated is not always a bad thing. Simple isn't always the best solution.
- Note that
to_sql()is quite slow, CSV import is a lot faster for bulk import. Neither answers the question the OP has, and the answer to that question is quite simple: switch quoting configuration.
- I concede that it doesn't answer the OP's question directly, but I stand by that it's an alternative, less performant but more flexible. Also, I'm having jitters and elated that you commented on my answer.