Data skew is still a problem with Spark.

- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks are light-weight in Spark, so it should not introduce a heavy overhead. If you have your own domain-partitioner, try to rewrite it by introducing a secondary-key.

I hope I gave some insights and help.

On Fri, Aug 14, 2015 at 9:37 AM Jeff Zhang <> wrote:
Data skew ? May your partition key has some special value like null or empty string 

On Fri, Aug 14, 2015 at 11:01 AM, randylu <> wrote:
  It is strange that there are always two tasks slower than others, and the
corresponding partitions's data are larger, no matter how many partitions?

Executor ID     Address                 Task Time       Shuffle Read Size /
1 16 s    1       99.5 MB / 18865432
*10 0 ms    0       413.5 MB / 311001318*
100 19 s    1       110.8 MB / 27075926
101 14 s    1       126.1 MB / 25052808

  Task time and records of Executor 10 seems strange, and the cpus on the
node are all 100% busy.

  Anyone meets the same problem,  Thanks in advance for any answer!

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

Best Regards

Jeff Zhang