Data skew is still a problem with Spark.

- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks are light-weight in Spark, so it should not introduce a heavy overhead. If you have your own domain-partitioner, try to rewrite it by introducing a secondary-key.

I hope I gave some insights and help.

On Fri, Aug 14, 2015 at 9:37 AM Jeff Zhang <zjffdu@gmail.com> wrote:
Data skew ? May your partition key has some special value like null or empty string 

On Fri, Aug 14, 2015 at 11:01 AM, randylu <randylu26@gmail.com> wrote:
  It is strange that there are always two tasks slower than others, and the
corresponding partitions's data are larger, no matter how many partitions?


Executor ID     Address                 Task Time       Shuffle Read Size /
Records
1       slave129.vsvs.com:56691 16 s    1       99.5 MB / 18865432
*10     slave317.vsvs.com:59281 0 ms    0       413.5 MB / 311001318*
100     slave290.vsvs.com:60241 19 s    1       110.8 MB / 27075926
101     slave323.vsvs.com:36246 14 s    1       126.1 MB / 25052808

  Task time and records of Executor 10 seems strange, and the cpus on the
node are all 100% busy.

  Anyone meets the same problem,  Thanks in advance for any answer!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org




--
Best Regards

Jeff Zhang