Is it possible to run spark yarn cluster from the code?

spark cluster setup
spark-submit yarn
spark local mode
spark-shell cluster mode
yarn-cluster vs yarn-client
spark files
run spark locally
spark-env sh

I have a MapReduce task which I want to run on Spark YARN cluster from my java code. Also I want to retrieve reduce result (string and number pair, tuple) in my java code. Something like:

// I know, it's wrong setMaster("YARN"), but just to describe what I want.
// I want to execute job ob the cluster.
SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("YARN");
JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD<Integer> input = sc.parallelize(list);

// map
JavaPairRDD<String, Integer> results = input.mapToPair(new MapToPairExample());

// reduce
String max = results.max(new MyResultsComparator())._1();

It works if I set master to local, local[] or spark://master:7707.

So the question is: can I do the same with yarn cluster somehow?

You need to do it using spark-submit. Spark submit handles many things for you from shipping dependencies to cluster and setting correct classpaths etc. When you are running it as main java program in local mode your IDE is taking care of the classpath(since driver/executors are running in same jvm).

You can also use "yarn-client" mode if you want your driver program to run on your machine.

For yarn-cluster mode use .setMaster("yarn-cluster")

Running Spark on YARN - Spark 3.0.0 Documentation, If the memory requested is above the maximum allowed, YARN will reject creation of the container, and your Spark application won't start. Get the  Hi @mbredif @lcaraffa I have successfully tested my simple c/c++ compiled code on Spark on YARN cluster. Apart what is inside the c/c++ and what it does, the process of executing an external/compiled c/c++ was much easier than I though on the YARN cluster: Put the already compiled c/c++ code on HDFS to be pushed to all executors Add the compiled file to the executors inline the code or with a

Typically, a spark-submit command works the following way when passing the master as yarn and deploy mode as cluster (source: Github code base for spark):

  1. spark-submit script calls Main.java
  2. Main.java calls SparkSubmit.java
  3. SparkSubmit.java calls YarnClusterApplication by figuring out the master and deploy parameters
  4. YarnClusterApplication calls Client.java
  5. Client.java talks to Resource Manager and hands over the ApplicationMaster.
  6. The Resource Manager instantiates ApplicationMaster.java in a container on a Node Manager.
  7. ApplicationMaster.java:
    1. allocates containers for executors using ExecutorRunnables
    2. uses reflection API to figure out the main method in the user supplied jar
    3. spawns a thread that executes the user application by invoking that main method from Step 6.2. This is where your code executes

In this flow, Steps 1-5 happen on the client/gateway machine. Starting from Step 6, everything executes on the Yarn cluster.

Now, to answer your question, I haven't ever tried executing spark in yarn-cluster mode from the code, but based on the above flow, your piece of code can only run within an application master container in a Node Manager machine of the Yarn cluster if you wish it to run in yarn-cluster mode. And, your code can reach there only if you specify spark-submit --master yarn --deploy-mode cluster from the command line. So specifying it in the code and:

  1. running the job e.g. from IDE will fail.
  2. running the job using spark-submit --master yarn --deploy-mode cluster will mean executing your code in a thread in the ApplicationMaster which runs on a Node Manager machine in the Yarn cluster and which will ultimately re-execute your setMaster("yarn-cluster") line of code which is now redundant but the rest of your code will run successfully.

Any corrections to this are welcome!

Running Spark on YARN - Spark 2.1.0 Documentation, When Spark is run on YARN, ResourceManager performs the role of the Spark master and NodeManagers works as executor nodes. While running Spark with  Solved: I have a problem while try to run spark-submit to yarn-cluster below is my spark-submit code spark-submit --master yarn-cluster --name Support Questions Find answers, ask questions, and share your expertise

would like to point you to some relevant classes that can help you do a spark submit from your code to yarn.

basically you can create a yarn deploy client : from the org.apache.spark:spark-yarn library. there is a package called : org.apache.spark.deploy.yarn which has a Client class.

the tricky part is that you should pass a sparkConf to that class and the spark conf should have your hadoopConf (of the cluster you try to deploy to)

for example you can try something like this (scala):

 def rawHadoopConf(cluster: String): Configuration = {
    val hadoopConfig = new Configuration(false)
    hadoopConfig.addResource(new URL(s"http://hadoop-$cluster.com:50070/conf").openStream())
    hadoopConfig.set("fs.defaultFS", s"hdfs://$cluster/")
    hadoopConfig
  } 

Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster , Running SparkPi in YARN Cluster Mode. To run SparkPi in cluster mode: spark- submit --class org.apache.spark.examples.SparkPi --master  We have Spark application written on Java that uses yarn-client mode. We build application into jar file and then run it on cluster with spark-submit tool. It works fine and everything is running well on cluster. But it is not very easy to test our application directly on cluster.

Deploying Spark on a cluster with YARN, So it is not possible to run cluster mode via spark-shell: spark-submit –class com. df.SparkWordCount SparkWC.jar yarn-client -> spark-submit –class com.df. SparkWordCount Hey, You can try this code to get READ MORE. The problem is, i have to do this entire process each time i make changes in my code. Is there any way to automate this process? Is it possible to configure Intellij Idea to build, deploy, and run spark application on the cluster by single click? So that i can run and debug my application without doing this entire process.

Running Spark Applications on YARN | 6.3.x, This blog pertains to Apache SPARK, YARN and HDFS and we will Finally, the code/Task will start executing in the Executor. In this case, it won't be able to give the output as the connection d. https://www.linode.com/docs/databases/ hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/  For computations, Spark and MapReduce run in parallel for the Spark jobs submitted to the cluster. Hadoop YARN/ Mesos Apache Spark runs on Mesos or YARN (Yet another Resource Navigator, one of the key features in the second-generation Hadoop) without any root-access or pre-installation.

How to launch spark application in cluster mode in Spark?, This how-to is for users of a Spark cluster who wish to run Python code using the YARN resource mananger. Spark YARN Summary¶. Before you start; Running  My main purpose is to get the appId after submitting the yarn-cluster task through java code, which is convenient for more business operations. Add the "--conf=spark.extraListeners=Mylistener" While SparkListener does work when I use Spark in standalone mode, it doesn't work when I run Spark on a cluster over Yarn.

Comments
  • .setMaster("yarn-cluster") #or yarn-client ./spark-submit --master yarn-cluster You can check your status at master:8088 and then click on Application master of running applications.