spark on yarn, Container exited with a non-zero exit code 143

container killed on request. exit code is 143 hortonworks
hadoop container exited with a non-zero exit code 143
container exited with a non-zero exit code 143 hive
spark exit codes
yarn diagnostics container killed on request exit code is 143
container killed on request. exit code is 143 cloudera
diagnostics: container killed on request. exit code is 137
spark exit code 1

I am using HDP 2.5, running spark-submit as yarn cluster mode.

I have tried to generate data using dataframe cross join. i.e

val generatedData = df1.join(df2).join(df3).join(df4)
generatedData.saveAsTable(...)....

df1 storage level is MEMORY_AND_DISK

df2,df3,df4 storage level is MEMORY_ONLY

df1 has much more records i.e 5 million while df2 to df4 has at most 100 records. doing so my explain plain would result with better performance using BroadcastNestedLoopJoin explain plan.

for some reason it always fail. I don't know how can I debug it and where the memory explode.

Error log output:

16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

I didn't see any WARN or ERROR logs before this error. What is the problem? where should I look for the memory consumption? I cannot see anything on the Storage tab of SparkUI. the log was taken from yarn resource manager UI on HDP 2.5

EDIT looking at the container log, it seems like it's a java.lang.OutOfMemoryError: GC overhead limit exceeded

I know how to increase the memory, but I don't have any memory anymore. How can I do a cartesian / product join with 4 Dataframes without getting this error.

I also meet this problem and try to solve it by refering some blog. 1. Run spark add conf bellow:

--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \
--conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC  ' \
  1. When jvm GC ,you will get follow message:
Heap after GC invocations=157 (full 98):
 PSYoungGen      total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000)
  from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000)
  to   space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000)
 ParOldGen       total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000)
  object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000)
 Metaspace       used 43044K, capacity 43310K, committed 44288K, reserved 1087488K
  class space    used 6618K, capacity 6701K, committed 6912K, reserved 1048576K  
}
  1. Both PSYoungGen and ParOldGen are 99% ,then you will get java.lang.OutOfMemoryError: GC overhead limit exceeded if more object was created .

  2. Try to add more memory for your executor or your driver when more memory resources are avaliable:

--executor-memory 10000m \ --driver-memory 10000m \

  1. For my case : memory for PSYoungGen are smaller then ParOldGen which causes many young object enter into ParOldGen memory area and finaly ParOldGen are not avaliable.So java.lang.OutOfMemoryError: Java heap space error appear.

  2. Adding conf for executor:

'spark.executor.extraJavaOptions=-XX:NewRatio=1 -XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '

-XX:NewRatio=rate rate = ParOldGen/PSYoungGen

It dependends.You can try GC strategy like

-XX:+UseSerialGC :Serial Collector 
-XX:+UseParallelGC :Parallel Collector
-XX:+UseParallelOldGC :Parallel Old collector 
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep 

Java Concurrent and Parallel GC

  1. If both step 4 and step 6 are done but still get error, you should consider change you code. For example, reduce iterator times in ML model.

spark on yarn, Container exited with a non-zero exit code 143 , Spark – Container exited with a non-zero exit code 143 18:11:41 ERROR YarnScheduler: Lost executor 1 on gsta31371.foo.com: Container marked as failed:� Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal In order to tackle memory issues with Spark, you first have to understand what happens under the hood. I won’t expand as in memoryOverhead issue in Spark , but I would like one to have this in mind: Cores, Memory and MemoryOverhead are three things that

Log file of all containers and am are available on,

yarn logs -applicationId application_1480922439133_0845_02

If you just want AM logs,

yarn logs -am -applicationId application_1480922439133_0845_02

If you want to find containers ran for this job,

yarn logs -applicationId application_1480922439133_0845_02|grep container_e33_1480922439133_0845_02

If you want just a single container log,

yarn logs -containerId container_e33_1480922439133_0845_02_000002

And for these commands to work, log aggregation must have been set to true, or you will have to get logs from individual server directories.

Update There is nothing you can do apart from try with swapping, but that will degrade performance alot.

The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.

Spark – Container exited with a non-zero exit code 143, Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal 16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on� Yarn Application status was failed, and displayed follows. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143. And Yarn Log Entries includes following message. ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver XX.XX.XX.XX

REASON 1


By default the shuffle count is 200. Having too many shuffle will increase the complexity and chances of getting program crashed. Try controlling the number of shuffles in the spark session. I changed the count to 5 using the below code.

implicit val sparkSession = org.apache.spark.sql.SparkSession.builder().enableHiveSupport().getOrCreate()    
sparkSession.sql("set spark.sql.shuffle.partitions=5")

Additionally if you are using dataframes and if you are not re-partitioning the dataframe, then the execution will be done in a single executor. If only 1 executor is running for some time then the yarn will make other executors to shut down. Later if more memory is required, though yarn tries to re-call the other executors sometimes the executors won't come up, hence the process might fail with memory overflow issue. To overcome this situation, try re-partitioning the dataframe before an action is called.

val df = df_temp.repartition(5)

Note that you might need to change the shuffle and partition count and according to your requirement. In my case the above combination worked.

REASON 2


It can occur due to memory is not getting cleared on time. For example, if you are running a spark command using Scala and that you are executing bunch of sql statements and exporting to csv. The data in some hive tables will be very huge and you have to manage the memory in your code.

Example, consider the below code where the lst_Sqls is a list that contains a set of sql commands

lst_Sqls.foreach(sqlCmd => spark.sql(sqlCmd).coalesce(1).write.format("com.databricks.spark.csv").option("delimiter","|").save("s3 path..."))

When you run this command sometimes you will end up seeing the same error. This is because although spark clears the memory, it does this in a lazy way, ie, your loop will be continuing but spark might be clearing the memory at some later point.

In such cases, you need to manage the memory in your code, ie, clear the memory after each command is executed. For this let us change our code little. I have commented what each line do in the below code.

 lst_Sqls.foreach(sqlCmd => 
 {          
      val df = spark.sql(sqlCmd) 
      // Store the result in in-memory. If in-memory is full, then it stored to HDD
      df.persist(StorageLevel.MEMORY_AND_DISK)
      // Export to csv from Dataframe
      df.coalesce(1).write.format("com.databricks.spark.csv").save("s3 path")
      // Clear the memory. Only after clearing memory, it will jump to next loop
      df.unpersist(blocking = true)
 })

Spark on Yarn. Spark need the resources to do the…, Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal. This exception is come from YarnAllocator, and it's very� After investigating the logs with yarn logs -applicationId <applicationId> -containerId <containerId>, it seemed that the problem came from a task that kept failing.Spark achieving fault tolerance, the task was repeated which resulted in the disks of my workers being out of space (above 90%).

code is 143 Container exited with a non-zero exit code 143, Hi ,. I have a MR job which is timing out on some nodes with below error. This is causing in Reduce Phase any Idea why this issue is repeating on only few� code is 143 Container exited with a non-zero exit code 143. code is 143 Container exited with a non-zero exit code 143 Also I had to enable yarn.nodemanager

spark Container killed on request. Exit code is 143 解决办法, 分类目录: hadoop/spark/scala 一,调整spark,driver和executor的内存 Container exited with a non-zero exit code 143 二,调整yarn资源. java.lang.OutOfMemoryError: GC overhead limit exceeded Container exited with a non-zero exit code 143. Killed by external signal. I see in the logs that the read is being executed properly with the given number of partitions as below:

Memory Issues in Hadoop — Qubole Data Service documentation, This topic describes the memory issues in Hadoop and YARN, which are as listed Exit code is 143 Container exited with a non-zero exit code 143 Failing this� Greetings, We are running a 10-datanode HDP v2.5 cluster on Ubuntu 14.04 installed by ambari. Whenever I run a large yarn job I get the following result: Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 I'm not su

Comments
  • If the size of the dataframes are as you suggest (5e6, 100, 100, 100), the cartesian product will have roughly 5e12 records, i.e. 5 trillion. You haven't mentioned the number of columns, but if you have a single integer column, this will require terabytes of storage. If you more than one column, the joined database could require hundreds or thousands of terabytes. Is this really what you want?
  • 1 column. It's a data generator utility, that got memory explosion.
  • thanks for the help, I have figured out what the problem. If you know how to solve it I would really appreciate that (l am updating the question)