spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Borislav Kapukaranov <b.kapukara...@gmail.com>
Subject Spark work distribution among execs
Date Tue, 15 Mar 2016 12:46:24 GMT
Hi,

I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
I observe a very strange issue.
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.

I run it with 25 executors with sufficient resources. However the strange
thing is that the job only uses 2 executors to do most of the read work.

For example when I go to the Executors' tab in the Spark UI and look at the
"Input" column, the difference between the nodes is huge, sometimes 20G vs
120G.

Any ideas how to achieve a more balanced performance?

Thanks,
Borislav

Mime
View raw message