How to make spark write a _SUCCESS file for empty parquet output?

One of my spark jobs is currently running over empty input and so produces no output. That's fine for now, but I still need to know that the spark job ran even if it produced no parquet output.

Is there a way of forcing spark to write a _SUCCESS file even if there was no output at all? Currently it doesn't write anything to the directory where there would be output if there was input so I've no way of determining if there was a failure (this is part of a larger automated pipeline and so it keeps rescheduling the job because there's no indication it already ran).

_SUCESS file is written by Hadoop code. So if your spark app doesn't generate any output you can use Hadoop API to create _SUCCESS file yourself.

If you are using PySpark - look into

If you are using Scala or Java - look into Hadoop API.

Alternative would be to ask Spark write empty dataset into to the output. But this might not what you need - because there will be part-00000 and _SUCESS file, which downstream consumers might not like.

Here is how to save empty dataset in pyspark (in Scala the code should be the same)

$ pyspark
>>> sc.parallelize([], 1).saveAsTextFile("/path/on/hdfs")
>>> exit()

$ hadoop fs -ls /path/on/hdfs
Found 2 items
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/_SUCCESS
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/part-00000

With Spark 1.6:

If writing a DataFrame with a forced schema and Avro writer, zero rows produces at least one part-r-{part number}.avro file (containing essentially a schema without rows) and a _SUCCESS file. With this pseudocode example:

resultData.persist(/* optional storage value */)

if(resultData.count == 0) 


It's possible to tweak avro to parquet and figure out the row count's relationship to the coalesce factor. (And ... switch to use approximate counts.) The above example brings up that schema may need to be forced on the internal data before writing. So ... this may be required:

case class Member(club : String, username : String)


Some useful imports / code may be:

import com.databricks.spark.avro._
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sparkContext)
import hiveContext.implicits._

Disclaimer: Some of this may be changed for Spark 2.x and all the above is 'scala-like pseudocode'.

In order to convert a RDD of myRow to a DataFrame, it's possible to use the read above to get the data or convert the RDD to an appropriate DataFrame with createDataFrame or toDF.

You can use emptyRDD for writing just _SUCCESS flag: spark.sparkContext.emptyRDD[MyRow].saveAsTextFile(outputPath)

  • How can I write an empty dataset with a schema of MyRow (where that is the type I use when there is data)? That might be what I need.
  • See the example in the answer
  • What is the type of you none empty dataset, which format you use to save it? In the example dataset is RDD with 0 records. Saving it to text file means serializing it with python's str(record) - which results in a file of 0 size.
  • I need to save an empty RDD[MyRow] as parquet.