spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bkapukaranov <>
Subject Spark work distribution among execs
Date Tue, 15 Mar 2016 13:27:30 GMT

I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. 
I observe a very strange issue. 
I run a simple job that reads about 1TB of json logs from a remote HDFS
cluster and converts them to parquet, then saves them to the local HDFS of
the Hadoop cluster.

I run it with 25 executors with sufficient resources. However the strange
thing is that the job only uses 2 executors to do most of the read work.

For example when I go to the Executors' tab in the Spark UI and look at the
"Input" column, the difference between the nodes is huge, sometimes 20G vs

1. What is the cause for this behaviour?
2. Any ideas how to achieve a more balanced performance?


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message