How to save a DataFrame as compressed (gzipped) CSV?

spark write csv no compression
spark write parquet no compression
spark write csv options
dataframewriter object has no attribute codec
pyspark write csv
spark option("compression", "gzip")
spark write csv without compression
spark write "zip file"

I use Spark 1.6.0 and Scala.

I want to save a DataFrame as compressed CSV format.

Here is what I have so far (assume I already have df and sc as SparkContext):

//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

df.write
  .format("com.databricks.spark.csv")
  .save(my_directory)

The output is not in gz format.

On the spark-csv github: https://github.com/databricks/spark-csv

One can read:

codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

In your case, this should work: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

dataFrameWriter save and compress, I want to save a DataFrame as compressed CSV format. Here is what I have so far (assume I already have df and sc as SparkContext ): //set the  Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 without having to make a local copy.

This code works for Spark 2.1, where .codec is not available.

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

How to Create Compressed Output Files in Spark 2.0, You could use, write() on dataframe, e.g.: val selectedData = df.select("year", "​model"); selectedData.write .format("com.databricks.spark.csv")  Saving a pandas dataframe as a CSV. Save the dataframe called “df” as csv. Note: I’ve commented out this line of code so it does not run.

DataFrameWriter (Spark 2.4.5 JavaDoc), Please see below on how to create compressed files in Spark 2.0. For Ex. It became lot easier to use the keyword "compression" "gzip" in 2.0. If dict given and mode is ‘zip’ or inferred as ‘zip’, other entries passed as additional compression options. Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.

To write the CSV file with headers and rename the part-000 file to .csv.gzip

DF.coalesce(1).write.format("com.databricks.spark.csv").mode("overwrite")
.option("header","true")
.option("codec",org.apache.hadoop.io.compress.GzipCodec").save(tempLocationFileName)

copyRename(tempLocationFileName, finalLocationFileName)

def copyRename(srcPath: String, dstPath: String): Unit =  {
  val hadoopConfig = new Configuration()
  val hdfs = FileSystem.get(hadoopConfig)
  FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
  // the "true" setting deletes the source files once they are merged into the new output
}

If you don't need the header then set it to false and you wouldn't need to do the coalesce either. It will be faster to write too.

pandas.DataFrame.to_parquet, Saves the content of the DataFrame in CSV format at the specified path. be used. compression (default null ): compression codec to use when saving to file. case-insensitive shorten names ( none , bzip2 , gzip , lz4 , snappy and deflate )​. Your CSV file will be saved at your chosen location in a shiny manner. Conclusion. You just saw the steps needed to create a DataFrame, and then export that DataFrame to a CSV file. You may face an opposite scenario in which you’ll need to import a CSV into Python.

Python compress csv, DataFrame. to_parquet (fname, engine='auto', compression='snappy', Write a DataFrame to the binary parquet format. DataFrame.to_csv: Write a csv file. compression='gzip') # doctest: +SKIP >>> pd.read_parquet('df.parquet.gzip')  read_csv(compression='gzip') fails while reading compressed file with tf.gfile.GFile in Python 2 #16241 Closed Sign up for free to join this conversation on GitHub .

Quick .gz Pandas tutorial, We will also explain how to compress (gzip) and split CSV files into multiple files. I am trying to write a dataframe to a gzipped csv in python pandas, using the  Step 3: Run the code to Export the DataFrame to CSV. Run the code in R, once you modified the path name to reflect the location where you’d like to store the DataFrame on your computer. A new CSV file would be created at your specified location. The data within that file should match with our DataFrame created in R:

Python: how to save a pandas dataframe in a compressed CSV file , #start with the smaller traffic station data #read in traffic station data traffic_station_df = pd.read_csv('../input/dot_traffic_stations_2015.txt.gz', compression='gzip',  Teams. Q&A for Work. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.

Comments
  • Related question about RDDs: stackoverflow.com/questions/32231650/…
  • While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value.
  • In case of using "json" format, the compression does not get picked up
  • It looks like the keyword argument has been changed to compression. spark.apache.org/docs/latest/api/python/…
  • Thanks for linking to csv writer docs, and not giving a databricks only answer!
  • @LaurensKoppenol - Well, to be fair, the CSV support added to Spark proper originally started as the external Databricks CSV package linked to in the accepted answer. :) That package is available to any Spark user to use, but starting with Spark 2.0 you don't need it anymore.
  • Thanks. This was very helpful.
  • I had to use df.write.option("compression","gzip").csv("path") with Spark 2.2.