I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster.
I observe a very strange issue.
I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster.
I run it with 25 executors with sufficient resources. However the strange thing is that the job only uses 2 executors to do most of the read work.
For example when I go to the Executors' tab in the Spark UI and look at the "Input" column, the difference between the nodes is huge, sometimes 20G vs 120G.
Any ideas how to achieve a more balanced performance?