save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

pyspark dataframe to hive table
spark hive integration
spark hive example scala
spark connect to hive metastore
spark create external table
spark enablehivesupport
spark dataframe write to file
create hive table from spark dataframe

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.

The documentation states:

"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."

Looking at the Spark tutorial, is seems that this property can be set:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")

# code to create dataframe

my_dataframe.saveAsTable("my_dataframe")

However, when I try to query the saved table in Hive it returns:

hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile

How do I save the table so that it's immediately readable in Hive?

I've been there... The API is kinda misleading on this one. DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source. It also stores something into Hive metastore, but not what you intend. This remark was made by spark-user mailing list regarding Spark 1.3.

If you wish to create a Hive table from Spark, you can use this approach: 1. Use Create Table ... via SparkSQL for Hive metastore. 2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)

save Spark dataframe to Hive: table not readable because “parquet , I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark. The documentation states: "spark.sql.hive.convertMetastoreParquet: When set  As you can see, we have created Spark data frame with two columns which holds student id and department id. Use DataFrame Writer to Save Spark DataFrame as a Hive Table. The next step is to use DataFrame writer to save dataFrame as a Hive table. This method works on all versions of the Apache Spark.

I hit this issue last week and was able to find a workaround

Here's the story: I can see the table in Hive if I created the table without partitionBy:

spark-shell>someDF.write.mode(SaveMode.Overwrite)
                  .format("parquet")
                  .saveAsTable("TBL_HIVE_IS_HAPPY")

hive> desc TBL_HIVE_IS_HAPPY;
      OK
      user_id                   string                                      
      email                     string                                      
      ts                        string                                      

But Hive can't understand the table schema(schema is empty...) if I do this:

spark-shell>someDF.write.mode(SaveMode.Overwrite)
                  .format("parquet")
                  .saveAsTable("TBL_HIVE_IS_NOT_HAPPY")

hive> desc TBL_HIVE_IS_NOT_HAPPY;
      # col_name                data_type               from_deserializer  

[Solution]:

spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
              .partitionBy("ts")
              .mode(SaveMode.Overwrite)
              .saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE


hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
                                       PARTITIONED BY(day STRING)
                                       STORED AS PARQUET
                                       LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;

The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location. I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!

Spark SQL Tutorial, How can I save a dataframe in to a Hive table or sql table using scala. scala> input Former HCC members be sure to read and learn how to activate your account here saveAsTable("temp_d") leads to file creation in hdfs but no table in hive. To save a DataFrame back to a Hive table: df.write.saveAsTable('table_name2',format='parquet',mode='overwrite') Now, you may want to try listing databases instead of tables.

I have done in pyspark, spark version 2.3.0 :

create empty table where we need to save/overwrite data like:

create table databaseName.NewTableName like databaseName.OldTableName;

then run below command:

df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");

The issue is you can't read this table with hive but you can read with spark.

how to access the hive tables from spark-shell, I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark. The documentation states: "spark.sql.hive. The main reason for enabling Transaction=True for hive tables was, the PutHiveStreaming Processor of Nifi expected the table to be ACID Compliant for it to work. Now we put the data into Hive, but Spark is not able to read it.

metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

save dataframe to a hive table, _ import spark.sql sql("CREATE TABLE IF NOT EXISTS src (key INT, value string) STORED AS PARQUET") // Save DataFrame to the Hive managed table val df a Hive table, you need to define how this table should read/write data from/to  Hive Tables. Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically.

save Spark dataframe to Hive: table not readable because, Manually Specifying Options; Save Modes; Saving to Persistent Tables Spark SQL can also be used to read data from an existing Hive installation. To use a HiveContext , you do not need to have an existing Hive setup, and all of the Due to this reason, we must reconcile Hive metastore schema with Parquet schema  Secondly, it is only suitable for batch processing, and not for interactive queries or iterative jobs. Spark SQL, on the other hand, addresses these issues remarkably well. We can directly access Hive tables on Spark SQL and use SQLContext queries or DataFrame APIs to work on those tables.

Hive Tables - Spark 2.4.5 Documentation, Create a table; Read a table; Write to a table; Schema validation; Update table Use DataFrameWriter (Scala or Java/Python) to write data into Delta Lake as an the table in the Hive metastore automatically inherits the schema, partitioning, and Because Delta tables auto update, a DataFrame loaded from a Delta table​  You also can save data in the Hive by the spark API method. And in case of a syntax error, your problem will fail at the very beginning, and this will save you a lot of time and nerves. Data frames have a special property write, to save data into any place. You can save data into Hive table by saveAsTable as table method. Let's try this.

Spark SQL and DataFrames, ParquetDecodingException: Can not read value at 1 in block 0 in file. Jan 9 This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet. // the `locationUri` field and save it to Hive metastore, because Hive only allows // directory as table location. // For example, an external data source table is created with a single file '/path/to/file'.

Comments
  • You have posted exactly the same answer here. if you think the question is a duplicate, you should mark it as such and not post the same answer twice imo.
  • It looks like the code for TBL_HIVE_IS_NOT_HAPPY and TBL_HIVE_IS_HAPPY, is exactly the same. Am I missing something?
  • perhaps he intended TBL_HIVE_IS_NOT_HAPPY example to be written by spark with .partitionBy("ts")