spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: spark on yarn succeeds but exit code 1 in logs
Date Sat, 31 Jan 2015 20:51:48 GMT
it is CDH 5.3 with the spark that ships with it.

i went through the RM logs line by line, and i found the exit code in there:

container_1422728945460_0001_01_000029 Container Transitioned from NEW to
RESERVED
2015-01-31 18:30:49,633 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
Reserved container container_1422728945460_0001_01_000029 on node host:
ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072
for application
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@7c96c19c
2015-01-31 18:30:50,638 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Releasing reservation that cannot be satisfied for application
appattempt_1422728945460_0001_000001 on node host:
ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072
2015-01-31 18:30:50,638 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Application application_1422728945460_0001 unreserved  on node host:
ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072,
currently has 0 at priority 1; currentReservation <memory:0, vCores:0>
2015-01-31 18:30:50,638 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Making reservation: node=ip-10-176-1-83.ec2.internal
app_id=application_1422728945460_0001
2015-01-31 18:30:50,639 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1422728945460_0001_01_000030 Container Transitioned from NEW to
RESERVED
2015-01-31 18:30:50,639 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
Reserved container container_1422728945460_0001_01_000030 on node host:
ip-10-176-1-83.ec2.internal:40695 #containers=2 available=1024 used=3072
for application
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt@7c96c19c
2015-01-31 18:30:50,724 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Updating application attempt appattempt_1422728945460_0001_000001 with
final state: FINISHING, and exit status: -1000
2015-01-31 18:30:50,729 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1422728945460_0001_000001 State change from RUNNING to
FINAL_SAVING


it seems to have something to do with reservations that cannot be
satisfied. i am running both my job and SparkPi without any specific
resource demands, and SparkPi runs fine, so not sure why that is. but at
least now i know where to dig further. thanks!

On Sat, Jan 31, 2015 at 3:28 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Can you look inside RM log to see if you can find some clue there ?
>
> You can pastebin part of the RM log around the time your job ran ?
>
> What hadoop version are you using ?
>
> Thanks
>
> On Sat, Jan 31, 2015 at 11:24 AM, Koert Kuipers <koert@tresata.com> wrote:
>
>> i have a simple spark app that i run with spark-submit on yarn. it runs
>> fine and shows up with finalStatus=SUCCEEDED in the resource manager logs.
>>
>> however in the nodemanager logs i see this:
>>
>> 2015-01-31 18:30:48,195 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>> Memory usage of ProcessTree 18613 for container-id conta\
>> iner_1422728945460_0001_01_000001: 146.7 MB of 1 GB physical memory used;
>> 1.1 GB of 2.1 GB virtual memory used
>> 2015-01-31 18:30:51,166 WARN
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
>> code from container container_1422728945460_0001_01_000002 is : 1
>> 2015-01-31 18:30:51,168 WARN
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
>> Exception from container-launch with container ID:
>> container_1422728945460_000\
>> 1_01_000002 and exit code: 1
>> ExitCodeException exitCode=1:
>>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>>         at
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:724)
>>
>> not much extra info besides a status code here.
>> i can find the logs for container_1422728945460_0001_01_000002, but they
>> seem perfectly fine. this is the tail of them:
>>
>> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Doing the fetch;
>> tracker actor = Actor[akka.tcp://sparkDriver@ip-10-176-1-83.ec2.internal
>> :41307/user/MapOutputTracker#-840\
>> 495197]
>> 15/01/31 18:30:49 INFO spark.MapOutputTrackerWorker: Got the output
>> locations
>> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Getting 1
>> non-empty blocks out of 1 blocks
>> 15/01/31 18:30:49 INFO storage.ShuffleBlockFetcherIterator: Started 0
>> remote fetches in 9 ms
>> 15/01/31 18:30:49 INFO Configuration.deprecation: mapred.work.output.dir
>> is deprecated. Instead, use mapreduce.task.output.dir
>> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: setting up task:
>> 'attempt_201501311830_0001_m_000000_1' -
>> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
>> mpt_201501311830_0001_m_000000_1
>> 15/01/31 18:30:49 INFO io.TapOutputCollector: creating path: part-00000
>> 15/01/31 18:30:49 INFO io.TapOutputCollector: closing tap collector for:
>> out4/part-00000
>> 15/01/31 18:30:49 INFO util.Hadoop18TapUtil: committing task:
>> 'attempt_201501311830_0001_m_000000_1' -
>> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4/_temporary/_atte\
>> mpt_201501311830_0001_m_000000_1
>> 15/01/31 18:30:50 INFO util.Hadoop18TapUtil: saved output of task
>> 'attempt_201501311830_0001_m_000000_1' to
>> hdfs://ip-10-176-1-83.ec2.internal:8020/user/koert/out4
>> 15/01/31 18:30:50 INFO executor.Executor: Finished task 0.0 in stage 1.0
>> (TID 1). 800 bytes result sent to driver
>> 15/01/31 18:30:50 ERROR executor.CoarseGrainedExecutorBackend: Driver
>> Disassociated [akka.tcp://sparkExecutor@ip-10-176-1-83.ec2.internal:36521]
>> -> [akka.tcp://sparkDriver@ip-\
>> 10-176-1-83.ec2.internal:41307] disassociated! Shutting down.
>>
>> so i am not sure why i am seeing that exit status. i ignored it so far,
>> except that it seems to have a weird side effect: after this job runs no
>> other jobs launch anymore on the same node, as if it got blacklisted or
>> something like that. and since this is a single node pseudo-distributed
>> cluster i use for testing nothing runs anymore. a restart of yarn fixes
>> that.
>>
>> i do not see the same problem with say SparkPi, so it seems to be
>> something in my app, but i do not know where to look.
>> any ideas? thanks! koert
>>
>
>

Mime
View raw message