Read all Parquet files saved in a folder via Spark
I have a folder containing Parquet files. Something like this:
scala> val df = sc.parallelize(List(1,2,3,4)).toDF() df: org.apache.spark.sql.DataFrame = [value: int] scala> df.write.parquet("/tmp/test/df/1.parquet") scala> val df = sc.parallelize(List(5,6,7,8)).toDF() df: org.apache.spark.sql.DataFrame = [value: int] scala> df.write.parquet("/tmp/test/df/2.parquet")
After saving dataframes when I go to read all parquet files in
df folder, it gives me error.
scala> val read = spark.read.parquet("/tmp/test/df") org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) ... 48 elided
I know I can read Parquet files by giving full path, but it would be better if there is a way to read all parquet files in a folder.
Spark doesn't write/read parquet the way you think it does.
It uses the Hadoop library to write/read partitioned parquet file.
Thus your first parquet file is under the path
1.parquet is a directory. This means that when reading from parquet you would need to provide the path to your parquet directory or path if it's one file.
val df = spark.read.parquet("/tmp/test/df/1.parquet/")
I advice you to read the official documentation for more details. [cf. SQL Programming Guide - Parquet Files]
You must be looking for something like this :
scala> sqlContext.range(1,100).write.save("/tmp/test/df/1.parquet") scala> sqlContext.range(100,500).write.save("/tmp/test/df/2.parquet") scala> val df = sqlContext.read.load("/tmp/test/df/*") // df: org.apache.spark.sql.DataFrame = [id: bigint] scala> df.show(3) // +---+ // | id| // +---+ // |400| // |401| // |402| // +---+ // only showing top 3 rows scala> df.count // res3: Long = 499
You can also use wildcards in your file paths URI.
And you can provide multiple files paths as followed :
scala> val df2 = sqlContext.read.load("/tmp/test/df/1.parquet","/tmp/test/df/2.parquet") // df2: org.apache.spark.sql.DataFrame = [id: bigint] scala> df2.count // res5: Long = 499
Read all Parquet files saved in a folder via Spark - scala - html, toDF() df: org.apache.spark.sql.DataFrame = [value: int] scala> df.write.parquet("/ tmp/test/df/2.parquet") After saving dataframes when I go to read all parquet files � Spark Read Parquet file into DataFrame. Similar to write, DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.
The file you wrote on
/tmp/test/df/2.parquet are not a output file they are output Directory.
so, you can read the parquet is
val data = spark.read.parquet("/tmp/test/df/1.parquet/")
How can I use Spark to read a whole directory instead of a single file?, Originally Answered: How do I use Spark to read whole directory instead of a single file? I prefer to write code df = hive_ctx.read.parquet(“inputPath”). For avro file fo How do I parse JSON data in a text file using Apache Spark and Scala? How can a DataFrame be directly saved as a textFile in scala on Apache spark? Parquet file. Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON. For further information, see Parquet Files.
You can write data into folder not as separate Spark "files" (in fact folders)
If don't set file name but only path, Spark will put files into the folder as real files (not folders), and automatically name that files.
df1.write.partitionBy("countryCode").format("parquet").mode("overwrite").save("/tmp/data1/") df2.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/") df3.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/")
Further we can read data from all files in data folder:
val df = spark.read.format("parquet").load("/tmp/data1/")
Parquet Files - Spark 3.0.0 Documentation, When reading Parquet files, all columns are automatically converted to be nullable for In a partitioned table, data are usually stored in different directories, with� In PySpark, parquet() function is available in DataFrameReader and DataFrameWriter to read from and write/create a Parquet file respectively. Parquet files maintain the schema along with the data hence it is used to process a structured file.
read all parquet files in a directory in spark , Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. // The result of loading a Parquet file is� Create an RDD DataFrame by reading a data from the parquet file named employee.parquet using the following statement. scala> val parqfile = sqlContext.read.parquet(“employee.parquet”) Store the DataFrame into the Table. Use the following command for storing the DataFrame data into a table named employee. After this command, we can apply all
Write and Read Parquet Files in Spark/Scala, write and read parquet files in Spark/Scala by using Spark SQLContext class. Jars: all libraries in my Spark jar folder (for Spark libraries used in the sample� Spark SQL provides spark.read.csv("path") to read a CSV file into Spark DataFrame and dataframe.write.csv("path") to save or write to the CSV file. In this tutorial, you will learn how to read a single file, multiple files, all files from a local directory into DataFrame, and applying some transformations finally writing DataFrame back to CSV file using Scala & Python (PySpark) example.
Reading parquet files from multiple directories in Pyspark, spark read parquet empty directory spark read all parquet files in directory. I need to read parquet files from multiple paths that are not parent or child directories. is preserved // The result of loading a parquet file is also a DataFrame Dataset� Parquet file on Amazon S3 Spark Read Parquet file from Amazon S3 into DataFrame. Similar to write, DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files from the Amazon S3 bucket and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.
- Ok that wasn't clear in your question. I have updated my answer.
- Thanks @eliasah.
val df = sqlContext.read.load("/tmp/test/df/*")worked for me.
- I am writing a Java code for prequet file in Spark version 2.1.1 Code:
Dataset<Row> df = spark.read().json("adl://xxxxxxxxx.azuredatalakestore.net/test/data.json"); df.write().mode("overwrite").parquet("data.parquet"); Dataset<Row> newDataDF = spark.read().parquet("data.parquet/") ;It is showing same error
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;"can you please assist me ?
- @umarfaraz you should post a question like everyone else. If I have some time to look at it, I will. I'm currently busy.
- I have the same error so i thought of posting it here, And sure take your time @eliasah