spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yu, Yucai" <yucai...@intel.com>
Subject RE: Spark Sql on large number of files (~500Megs each) fails after couple of hours
Date Mon, 11 Apr 2016 03:10:24 GMT
Hi Yash,

How about checking the executor(yarn container) log? Most of time, it shows more details,
we are using CDH, the log is at:

[yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
/mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
[yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
total 408
-rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
-rw-r--r-- 1 yucai DP  22302 Mar 13 18:04 stdout

Please pay attention, you had better check the first failure container .

Thanks,
Yucai

From: Yash Sharma [mailto:yash360@gmail.com]
Sent: Monday, April 11, 2016 10:46 AM
To: dev@spark.apache.org
Subject: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly
400-500 Megs.

I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and
Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs
on a 12 node cluster.

I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000
but havn't seen lot of difference.

From the logs I have the yarn error attached at end [1]. I have got the below spark configs
[2] for the job.

Is there any other tuning I need to look into. Any tips would be appreciated,

Thanks


2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2


1. Yarn Error:

16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003.
Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Mime
View raw message