Spark submit to yarn as a another user
Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?
For a non-kerberized cluster:
export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to
unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).
For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...
export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$ kinit -kt ~/.protectedDir/zorro.keytab zorro@MY.REALM spark-submit ........... kdestroy
Impersonate another user through spark Job on a unsecure cluster, Impersonate another user through spark Job on a unsecure cluster. Impersonate another Client.run(Client.scala:1109) at org.apache.spark.deploy.yarn. SparkSubmit$$anon$1.run(SparkSubmit.scala:160) at java.security. When running on YARN, the driver can run in one YARN container in the cluster (cluster mode) or locally within the spark-submit process (client mode). When running in cluster mode, the driver runs on ApplicationMaster, the component that submits YARN container requests to the YARN ResourceManager according to the resources needed by the application.
For a non-kerberized cluster you can add a Spark conf as:
Running Spark on YARN - Spark 2.4.5 Documentation, user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG). The cluster manager then launches executors on the worker nodes on behalf of the driver. When I run my jobs through spark-submit (locally on the HDP Linux), everything works fine, but when I try to submit it remotely through YARN, (from a web application running on a Tomcat environment in Eclipse), the job is submitted but raised the following error:
Another (much safer) approach is to use
proxy authentication - basically you create a service account and then allow it to impersonate to other users.
$ spark-submit --help 2>&1 | grep proxy --proxy-user NAME User to impersonate when submitting the application.
Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created
svc_spark_prd service account/ user.
hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications.
* is allowed but nor recommended for any host.
Also specify either
hadoop.proxyuser.svc_spark_prd.groups to list users or groups that
svc_spark_prd is allowed to impersonate.
* is allowed but not recommended for any user/group.
Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.
Apache Spark Architecture Explained in Detail, Start the Mesos dispatcher as the user who will be submitting jobs. In the following example, user 'test' starts the dispatcher. sudo -u test /opt/mapr/spark/spark- Spark doesn't allow to submit keytab and principal with proxy-user. The feature description in the official documentation for YARN mode (second paragraph) states specifically that you need keytab and principal when you are running long running jobs. This enables the application to continue working with any security issue.
If your user exists, you can still launch your spark submit with su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer: how to run script as another user without password
Submitting Spark batch applications, The spark-submit script in Spark's bin directory is used to launch applications Once a user application is bundled, it can be launched using the bin/spark-submit script. with Spark and its dependencies, and can support different cluster managers SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client Talking about rstudio/spark.rstudio.com#4 (that moved from #115 here is a short instruction right now. For a Hadoop YARN cluster, you can connect using the YARN master, for example:
How to pass parameters / properties to Spark jobs with spark-submit , So, I logged in the master node as `centos` user and executed this command: sudo -u hdfs spark-submit --master yarn --deploy-mode cluster There are 3 ways you can submit Spark jobs using Apache Airflow remotely: (1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status.
Running Jobs as Other Users in Cluster Deploy Mode, When Spark applications run on a YARN cluster manager, resource management, Spark applications that require user input, such as spark-shell and pyspark , require the For other spark-submit options, see spark-submit Arguments. For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg. Launching Applications with spark-submit. Once a user application is bundled, it can be launched using the bin/spark
Running Spark on YARN - Spark 3.0.0 Documentation, Ensure that the execution user for the Spark driver consumer in the Spark instance group has access to the keytab file. spark.yarn.principal= principal@REALM. Solved: when submitting a job that failed, it try to run with another attemped, how i can disable the second run ? what configuration param i must Support Questions Find answers, ask questions, and share your expertise