Spark: Reading avro file without com.databricks.spark.avro
I wanted to read the avro file in spark but unfortunately the cluster in my company does not have com.databricks.spark.avro.
So I tried
spark-shell --package com.databricks:spark-avro_2.10:0.1.
This gives unresolved dependency
import com.databricks.spark.avro._ is not supported.
spark-shell --jar spark-avro_2.11-3.2.0.jar
This does not open the shell.
spark.read.format("com.databricks.spark.avro").load("dirpath/*.avro") returns org.apache.spark.sql.AnalysisException: Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;
spark.read.avro("dirpath/*.avro") returns error: value avro is not a member of org.apache.spark.sql.DataFrameReader
This table is so big(avro table partitioned on date/field1/field2 and running as spark.sql("") returns GC overhead.
Any help please.
First of all its not
Secondly, version seems to be incomplete.
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1
Then import the avro essential.
Avro files, This blog elaborates on Apache Spark 2.4 built-in Avro package, shows how to to read and write Avro data within a DataFrame instead of just files. same tables using this built-in Avro module, without any code changes. Similar to from_json and to_json, you can use from_avro and to_avro with any binary column, but you must specify the Avro schema manually. import org.apache.spark.sql.avro.functions._ import org.apache.avro.SchemaBuilder // When reading the key and value of a Kafka topic, decode the // binary (Avro) data into structured data.
This issue occurs the way how you specify the avro jars onto the cluster. If you place databricks jars as part of spark classpath it makes it available to driver and executors but if you use some kind of launcher like spark launcher or Apache Livy you have to explicitly add as part of Spark Session. I resolved it like adding extra properties
sparkLauncher.setConf("spark.driver.extraClassPath", "com.databricks-spark-avro_2.11-4.0.0.jar") sparkLauncher.setConf("spark.executor.extraClassPath", "com.databricks-spark-avro_2.11-4.0.0.jar")
This is safe way to prevent spark runtime sql exceptions
Apache Avro as a Built-in Data Source in Apache Spark 2.4, First of all its not --package it's --packages . Secondly, version seems to be incomplete. spark-shell --packages To read/write the data source tables that were previously created using com.databricks.spark.avro, you can load/write these same tables using this built-in Avro module, without any code changes. In fact, if you prefer to using your own build of a spark-avro jar file, you can simply disable the configuration spark.sql.legacy
in spark cluster you need
spark-avro jar file. you can download it from https://spark-packages.org/package/databricks/spark-avro. after downloading copy the file into
Spark: Reading avro file without com.databricks.spark.avro, Avro Data Files · Compression for Avro Data Files · Using Flume with Avro The spark-avro library allows you to write and read partitioned data without extra are read into Spark because Spark does not support enumerated types. val df = sqlContext.read.format("com.databricks.spark.avro").load(" input Apache Avro is a data serialization format. We can store data as .avro files on disk. Avro files are typically used with Spark but Spark is completely independent of Avro. Avro is a row-based format that is suitable for evolving data schemas. One benefit of using Avro is that schema and metadata travels with the data. If you have an .avro file, you have the schema of the data as well. The
Using the spark-avro Library to Access Avro Data Sources, To include all files, set the avro.mapred.ignore.inputs.without.extension property to false. The spark-avro library supports writing and reading partitioned data. table_name USING com.databricks.spark.avro OPTIONS (path " input_dir ")) df Avro Data Source for Apache Spark. Databricks has donated this library to the Apache Spark project, as of Spark 2.4.0.Databricks customers can also use this library directly on the Databricks Unified Analytics Platform without any additional dependency configurations.
Accessing Avro Data Files From Spark SQL Applications, I have a directory of Avro files in S3 that do not have .avro extensions. I'm trying in Spark 2.0 to read from that directory, but it doesn't seem to be I wanted to read the avro file in spark but unfortunately the cluster in my company does not have com.databricks.spark.avro. So I tried spark-shell --package com.databricks:spark-avro_2.10:0.1.
Requiring .avro extension in Spark 2.0+ · Issue #203 · databricks , Apache Avro is an open-source, row-based, data that supports reading and writing data in Avro file format. it is mostly by a schema that permits full processing of that data without code generation. I have a directory of Avro files in S3 that do not have .avro extensions. I'm trying in Spark 2.0 to read from that directory, but it doesn't seem to be reading in the configuration changes.