Spark doesnt seem to use the same warehouse that Hive uses

spark-warehouse
spark create external table
spark sql
spark sql hive convert_metastore_orc
spark hive
hive.metastore.warehouse.dir spark
spark sql/hive convertmetastoreparquet
spark metadata store

I have started using Spark 2.0 on my Eclipse, by making a maven project and getting in all the latest dependencies. I am able to run hive queries without any problems. My concern is that Spark creates another warehouse for hive and doesn't use the data warehouse that I want. So all the hive tables that I have on my server, I'm not able to read those hive tables into my Spark datasets and do any transformations. I'm only able to create and work on new tables, but i want to read my tables in hive.

My hive-site.xml :-

<configuration><property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
  <description>metadata is stored in a MySQL server</description></property>        <property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>MySQL JDBC driver class</description></property><property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser</value>
  <description>user name for connecting to mysql server</description></property><property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
  <description>password for connecting to mysql server</description></property><property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/usr/local/Cellar/hive–1.1.0/apache-hive-1.1.0-bin/spark-warehouse</value>
  <description>location of default database for the warehouse</description></property></configuration>

Spark SQL and DataFrames, Deploying in Existing Hive Warehouses; Supported Hive Features When computing a result the same execution engine is used, independent of You can also interact with the SQL interface using the command-line or over JDBC/ODBC. write queries using HiveQL, access to Hive UDFs, and the ability to read data from  If the property 'hive.metastore.warehouse.dir' is the same in both hive-site.xml that hive uses and this embedded spark program, it should be able to save it in the same database right. Coz even after specifying the same value for that property, it doesnt seem to keep the metadata in the same location and isnt accessing the hive tables via spark.

In hive-site.xml add,

  <property>
    <name>hive.metastore.uris</name>
   <value>thrift://HOST_IP_ADDRESS:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

restart hive service

and then set,

1)Copy hive-site.xml from $HIVE_CONF dir to $SPARK_CONF dir

or 2)

HiveContext hiveContext = new HiveContext(sc);

hiveContext.setConf("hive.metastore.uris", "thrift://HOST_IP_ADDRESS:9083");

Applications of Lasers, spark-gap exciter. 6.7 Defense Applications If the number of articles published in popular a good fraction of Niagara Falls to fill a capacitor bank in order to store up. 1010. joules to fire 10,000 separate lasers it doesn't seem likely that the device They also find use in schlieren photography—an optical technique for  If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.

you should config in spark-defaults.conf:

spark.sql.warehouse.dir hdfs://MA:8020/user/hive/warehouse

From http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Spark SQL and DataFrames, Deploying in Existing Hive Warehouses; Supported Hive Features When computing a result the same execution engine is used, independent of which One use of Spark SQL is to execute SQL queries written using either a basic SQL to write queries using the more complete HiveQL parser, access to Hive UDFs, and  Today we'll learn about connecting and running Apache Spark Scala code with Apache Hive Hadoop datastore for data warehouse queries from Spark. Apache Spark With Apache Hive. the same as

Hadoop vs. Spark: A Head-To-Head Comparison, With each year, there seem to be more and more distributed systems on the market In addition to using HDFS for file storage, Hadoop can also now be to running a HiveQL query on top of data stored in the Hive warehouse) to Hadoop doesn't have any cyclical connection between MapReduce steps  91% use Apache Spark because of its performance gains. 77% use Apache Spark as it is easy to use. 71% use Apache Spark due to the ease of deployment. 64% use Apache Spark to leverage advanced analytics; 52% use Apache Spark for real-time streaming.

Known Hadoop Errors, When using a Radoop Proxy or a SOCKS Proxy, HDFS operations may fail This message doesn't affect the execution or the results of the process. the parsing of a HiveQL command may start a SparkSession, and during that period other Seems to only come up when an INSERT is used (Store in Hive operator)​. Another option is to create a custom UDF function in Hive that converts data and then run that. Another option is to do ETL lookup transformations in Spark, Storm, Flink and call via Site-To-Site or Kafka. Load the lookup values into the DistributedMapCache and use them for replacements. PutDistributedMapCache; FetchDistributedMapCache

What is the difference between Apache Hive and Apache Spark , Apache Spark * An open source, Hadoop-compatible, fast and expressive Apache Hive is a data warehouse software project built on top of Apache and managing large data sets in a distributed environment using SQL-like interface. that Spark/SparkSQL seems to be able to handle the same sort of data operations? Now, in order for Spark to know Hive, I am supposed to copy the hive-site.xml file from Hive to Spark's conf directory. The problem is that I do not have any hive-site.xml file ! I guess that Hive has it's built in default properties and it uses them when there is no configuration file (and it uses the built in Derby metastore).

Comments
  • Spark creates another warehouse for hive..... which hdfs path does it stores? are you using derby db?
  • I want it to use the same db that normal hive uses so that i can access the default hive tables
  • can you share hive-site.xml
  • ive edited the question with hive-site.xml
  • Hey I am also facing the same issue. Can you please let me know how you resolved it? Thanks
  • we do not mention the username when we look for the table. its the db name, but the db that exists in hive, will not exist in spark-hive. and I do not have the option to print the configurations as you have mentioned, because Im using SparkSession instead of HiveContext, because it is Spark 2.0
  • were you able to try above?
  • Yes, i tried out today. it displays all the configurations that are related to this instance of SparkSession. If the property 'hive.metastore.warehouse.dir' is the same in both hive-site.xml that hive uses and this embedded spark program, it should be able to save it in the same database right. Coz even after specifying the same value for that property, it doesnt seem to keep the metadata in the same location and isnt accessing the hive tables via spark.
  • hmm, there will be certain mismatch
  • I doubt somewhere hive.metastore.warehouse.dir might have overriden
  • Which $SPARK_CONF as im doing this in a maven project.. We dont have any $SPARK_CONF directory. All I do is get the spark dependencies and add in the pom and run
  • where is your spark installed
  • I dont need to install if i use its dependencies
  • do i have to change the xml as well to use the remote hive warehouse ?