Spark 2: how does it work when SparkSession enableHiveSupport() is invoked
stop spark session pyspark
spark-shell enable hive support
# this sparkcontext may be an existing one.
sharing spark session
My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.
I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.
I create a session in my Spark program as follows:
SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
Suppose I have the following HiveQL query:
spark.sql("SELECT someColumn FROM someTable")
I would like to know whether:
- under the hood this query is translated into Hive MapReduce primitives, or
- the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.
I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with
spark.sql([hiveQL query]) refer to Spark or Hive.
Spark knows two catalogs, hive and in-memory. If you set
spark.sql.catalogImplementation is set to
hive, otherwise to
in-memory. So if you enable hive support,
spark.catalog.listTables().show() will show you all tables from the hive metastore.
But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.
*there are actually some functions like
percentile_approx which are native hive UDAF.
How to use SparkSession in Apache Spark 2.0, Spark 2: how does it work when SparkSession enableHiveSupport() is invoked - apache-spark So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore. But this does not mean hive is The developers of Spark 2.0 maintained backwards compatibility with Spark 1.x when they introduced SparkSession, so all of your existing code should still work in Spark 2.0. When you are ready to modernize your code, you should understand the relationships between the older classes and SparkSession.
pyspark.sql module, enableHiveSupport() . Once the SparkSession is instantiated, you can configure Spark's all settings val configMap:Map[String, String] = spark.conf.getAll() that contains methods that work with the metastore (i.e data catalog). orderBy(desc("percent")).show(false). Fig 2. DataFrame & Dataset output The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern: >>> spark = SparkSession.builder \
"under the hood this query is translated into Hive MapReduce primitives, or the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood."
I use spark SQL on Hive metastore. The way I verifid whether query is translated into Map/Reduce or not is to check: a. Open Hive console and run a simple SELECT query with some filter. Now go to YARN resource manager. You will see some Map/reduce jobs getting fired as a result of query execution. b. Run spark SQL using HiveContext and execute the same SQL query. Spark SQL will capitalize on the metastore information of Hive without triggering Map/Reduce jobs. Go to Resource Manager in YARN and verify it. You will only find the spark-shell session running and there is no additional Map/Reduce job that gets fired on the cluster.
A tale of Spark Session and Spark Context - achilleus, A SparkSession can be used create DataFrame, register DataFrame as The entry point for working with structured data (rows and columns) in Spark, sqlContext.range(1, 7, 2).collect() [Row(id=1), Row(id=3), Row(id=5)] enableHiveSupport(). If Column.otherwise() is not invoked, None is returned for unmatched SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.
There are three execution engines, MapReduce, Tez and Spark.
When you execute query using hive you can choose to use one of the above-mentioned engines. Usually your admins must have set one of the engines as default engine.
When you execute the query using Spark, it will use the spark engine to execute the query.
However, if you are doing performance analysis Time is not the only thing you should measure, memory and CPU should be measured as well.
Introduction to SparkSession, enableHiveSupport() .getOrCreate. The spark session builder will try to get a spark session if there is one already created or create a new one Spark session is a unified entry point of a spark application from Spark 2.0. It is one of the very first objects you create while developing a Spark SQL application. As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).
Introduction to Spark 2.0, “spark://master:7077” to run on a spark standalone cluster. appName( ) enableHiveSupport() SparkSession can be used to execute SQL queries over data, getting the result back as a DataFrame (i.e. The timeline for migration is 2 months. The principle is often called "turning the database inside out". Spark 1.x : SQLContext didn’t support creating external tables. So we need to use hivecontext for do that. [code]val hiveContext = new HiveContext(sparkContext) hiveContext.setConf("hive.metastore.warehouse.dir", "/tmp") hiveContext.createExterna
SparkSession, But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will So in Spark 2.0, we have a new entry point for DataSet and Dataframe API's called as Spark Session. getOrCreate() enableHiveSupport on factory enables hive support which is similiar to HiveContext. SparkR (R on Spark) Overview. SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.0.1, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets.
Spark: Why should we use SparkSession ?, enableHiveSupport() // self-explanatory, isn't it? You can have as many SparkSessions as you want in a single Spark scala> spark.range(start = 0, end = 4, step = 2, numPartitions = 5).show +---+ spark-sql is the main SQL environment in Spark to work with pure SQL statements (where you do not have to use Scala to SQLContext (sparkContext, sparkSession=None, jsqlContext=None) [source] ¶ The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x. As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
- In v2.4.2 I used
spark.catalog.listTables().show()as the show() did not exist for list object.
- @SaulCruz My solution works in 2.4.2 too.
listTables()returns a Dataset which has the
- interesting I'm not sure what is missing then
- @SaulCruz maybe you are not using Scala?
- haha @RaphaelRoth that was it, i thought the post was pyspark, good catch!
- I checked it out and it is MapReduce (mr). Does this mean that Spark forwards the HiveQL query to Hive which translates it to MR jobs? So I guess I can claim that my performance results refer to HiveQL using MR as the execution engine?
- @AnthonyArrascue Spark doesn;t forward any query to hive, it used it's own engine, Spark to execute query. However, your Hive used MR to execute Query.
- This does not answer the question. The execution engines that you are talking about are specific to Hive. Hive on Spark is different from executing a query over hive via Spark.
- @philantrovert that's what I am talking about, executing hive queries from spark is going to use spark execution engine and not going to use engine set by hive.