AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

aws emr add-steps spark python
boto3 emr spark step
emr spark-submit python
aws emr spark cli

I'm trying to launch a cluster using AWS Cli. I use the following command:

aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium

The cluster is created successfully. Then I add this command:

aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE

After some time, the step failed. This is the LOG file:

 17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31-  31-190.us-west-2.compute.internal/172.31.31.190:8032
 17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
 17/02/22 11:00:08 INFO Client: Verifying our application has not requested  
 Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
 17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
 Command exiting with ret '1'

Locally (on SandBox Hortonworks HDP 2.5) I run:

./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3

and everything works fine. I've already read something related to my problem, but I can't figure it out.

UPDATE

Checked into Application Master, I get this error:

17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)

at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
 17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))

I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:

for(line <- scala.io.Source.fromFile(logFile).getLines())

How could I solve it? Thanks in advance.

Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.

Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)

AWS EMR using spark steps in cluster mode. Application - html, AWS EMR using spark steps in cluster mode. Application application_ finished with failed status - apache-spark. AWS EMR using spark steps in cluster mode. Application application_ finished with failed status. Ask Question Asked 3 years, 3 months ago. Active 1 year,

What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.

Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD

Troubleshoot a Failed Spark Step in Amazon EMR, How do I troubleshoot a failed Apache Spark step in Amazon EMR? For Spark jobs submitted with --deploy-mode cluster: Check the step logs to -type f -exec gunzip {} \; #Get the yarn application id from the cluster mode log: grep Please increase executor memory using the --executor-memory option� aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[spark-example,SparkPi,10] To submit work to Spark using the SDK for Java

There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.

In this scenario what I did is :

Step 1: Checked for the file existence in the project directory which we copied to EMR.

for example mine was in `//usr/local/project_folder/`

Step 2: Copy the script which you're expecting to run on the EMR.

for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`

Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar

command-runner.jar bash /home/hadoop/script_name.sh

Thus I found my script running. Hope this may be helpful to someone

Spark job with, Spark job with "master=yarn-cluster" succeeds but reports as Failed working on a project that requires me to submit a .jar file to Spark on EMR through S3. 0 ( reduce at MyPi.scala:29) finished in 1.849 s 17/07/17 22:47:01 INFO INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0� Spark applications running on EMR. Any application submitted to Spark running on EMR runs on YARN, and each Spark executor runs as a YARN container. When running on YARN, the driver can run in one YARN container in the cluster (cluster mode) or locally within the spark-submit process (client mode).

Solved: Issue running spark application in Yarn-cluster mo , SparkException: Application application_1487760984275_0001 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132) at� For more information about reading the cluster summary, see View Cluster Status and Details. Summary of Quick Options. The following table describes the fields and default values when you launch a cluster using the Quick cluster configuration page in the Amazon EMR console.

aws emr add-steps, But when I try to run it on yarn-cluster using spark-submit, it runs for some time and then $spark-submit --master yarn --deploy-mode cluster network_wordcount.py SparkException: Application finished with failed status Submit Apache Spark jobs with the EMR Step API, use Spark with EMRFS to directly access data in S3, save costs using EC2 Spot capacity, use fully-managed Auto Scaling to dynamically add and remove capacity, and launch long-running or transient clusters to match your workload.

boto3 seems to be breaking with apache spark in yarn mode , aws emr add-steps --cluster-id j-8S4BHCR3UV7G --steps Name=Spark. Brian ONeill's Random Thoughts: Amazon EMR: five ways to improve the way you use Hadoop --steps Name=SparkSubmit. aws cli - AWS EMR using spark steps in cluster mode. Application application_ finished with failed status - Stack Overflow� The AWS::EMR::Cluster resource specifies an Amazon EMR cluster. This cluster is a collection of Amazon EC2 instances that run open source big data frameworks and applications to process and analyze vast amounts of data.

Comments
  • thank you so much, now I have understood the problem. I have updated my answer, check this out please
  • ok i'm in. I'm trying a new strategy. I put the file into master node with 'put' command. It uploads the file in /home/hadoop/ but this file is unreachable from slaves I think, infact I get the same error.
  • you don't need to do that, S3 will work as a source, just use the hadoopRDD() function to tell spark this is coming from a hadoop compatible filesystem