How to make spark write a _SUCCESS file for empty parquet output?

Related searches

One of my spark jobs is currently running over empty input and so produces no output. That's fine for now, but I still need to know that the spark job ran even if it produced no parquet output.

Is there a way of forcing spark to write a _SUCCESS file even if there was no output at all? Currently it doesn't write anything to the directory where there would be output if there was input so I've no way of determining if there was a failure (this is part of a larger automated pipeline and so it keeps rescheduling the job because there's no indication it already ran).

_SUCESS file is written by Hadoop code. So if your spark app doesn't generate any output you can use Hadoop API to create _SUCCESS file yourself.

If you are using PySpark - look into https://github.com/spotify/snakebite

If you are using Scala or Java - look into Hadoop API.

Alternative would be to ask Spark write empty dataset into to the output. But this might not what you need - because there will be part-00000 and _SUCESS file, which downstream consumers might not like.

Here is how to save empty dataset in pyspark (in Scala the code should be the same)

$ pyspark
>>> sc.parallelize([], 1).saveAsTextFile("/path/on/hdfs")
>>> exit()

$ hadoop fs -ls /path/on/hdfs
Found 2 items
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/_SUCCESS
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/part-00000

[SPARK-23271[SQL] Parquet output contains only _SUCCESS file , Parquet output contains only _SUCCESS file after writing an empty dataframe # SPARK-23271 If we are attempting to write a zero partition rdd, create a� We’ll need to use spark-daria to access a method that’ll output a single file. Writing out a file with a specific name. You can use the DariaWriters.writeSingleFile function defined in spark-daria to write out a single file with a specific filename. Here’s the code that writes out the contents of a DataFrame to the ~/Documents/better

With Spark 1.6:

If writing a DataFrame with a forced schema and Avro writer, zero rows produces at least one part-r-{part number}.avro file (containing essentially a schema without rows) and a _SUCCESS file. With this pseudocode example:

resultData.persist(/* optional storage value */)

if(resultData.count == 0) 
  resultData
    .coalesce(1)
    .write
    .avro(memberRelationshipMapOutputDir)
else 
  doSomething()

resultData.unpersist()

It's possible to tweak avro to parquet and figure out the row count's relationship to the coalesce factor. (And ... switch to use approximate counts.) The above example brings up that schema may need to be forced on the internal data before writing. So ... this may be required:

case class Member(club : String, username : String)

hiveContext
    .read
    .schema(ScalaReflection.schemaFor[Member].dataType.asInstanceOf[StructType])
    .avro(memberRelationshipMapInputDir)

Some useful imports / code may be:

import com.databricks.spark.avro._
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.hive.HiveContext


val hiveContext = new HiveContext(sparkContext)
import hiveContext.implicits._

Disclaimer: Some of this may be changed for Spark 2.x and all the above is 'scala-like pseudocode'.

In order to convert a RDD of myRow to a DataFrame, it's possible to use the read above to get the data or convert the RDD to an appropriate DataFrame with createDataFrame or toDF.

[#SPARK-26052] Spark should output a _SUCCESS file for every , When writing a set of partitioned Parquet files to HDFS using dataframe.write. parquet(), a _SUCCESS file is written to hdfs://path/to/table after successful� Similar to write, DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.

You can use emptyRDD for writing just _SUCCESS flag: spark.sparkContext.emptyRDD[MyRow].saveAsTextFile(outputPath)

Spark Data Frame Save As Parquet, What size should my parquet file-parts be and how can I make Spark write them that size? I think I -rwxr-xr-x 1 glpmpbd pmpbatch 0 May 31 12:46 _SUCCESS. What changes were proposed in this pull request? Below are the two cases. case 1 scala> List.empty[String].toDF().rdd.partitions.length res18: Int = 1 When we write the above data frame as parquet, we create a parquet file containing just the schema of the data frame.

How to check for empty RDD in PySpark, Why "ValueError: RDD is , I am trying to create an empty dataframe in Spark ( Pyspark). How to make spark write a _SUCCESS file for empty parquet output � koala_us_presidents/ _SUCCESS part-00000-1943a0a6-951f-4274-a914-141014e8e3df-c000.snappy.parquet Pandas and Spark can happily coexist. Pandas is great for reading relatively small datasets and writing out a single Parquet file. Spark is great for reading and writing huge datasets and processing tons of files in parallel.

Solved: Is there a issue with saving ORC data with Spark S , I have the SparkSession created with enableHiveSupport(). Let's say -rw-r--r-- 3 falbani falbani top10Salaries-Repartition1/_SUCCESS and scroll down to Output Sinks we can see it says you can write orc, json, csv etc. This way you can escape having empty file issue, if you are getting any with Parquet etc, where the� The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data. When you save a partition with no data you will get an empty file.

Spark Data Frame Save As Parquet - Too Many Files? I'm trying to generate a substantial test data set in parquet to see the query speeds I can get from Drill. My parquet file seems to have a whole ton of very tiny sub-files though, and I believe I read that this is bad for drill performance.

Comments
  • How can I write an empty dataset with a schema of MyRow (where that is the type I use when there is data)? That might be what I need.
  • See the example in the answer
  • What is the type of you none empty dataset, which format you use to save it? In the example dataset is RDD with 0 records. Saving it to text file means serializing it with python's str(record) - which results in a file of 0 size.
  • I need to save an empty RDD[MyRow] as parquet.