I am going to take a guess that this means that your partitions within an RDD are not balanced (one or more partitions are much larger than the rest). This would mean a single core would need to do much more work than the rest leading to poor performance. In general, the way to fix this is to spread data across partitions evenly. In most cases calling repartition is enough to solve the problem. If you have a special case you might need create your own custom partitioner.


On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman <selmna@gmail.com> wrote:


What is skew data.

I read that if the data was skewed while joining it would take long time to finish the job.(99 percent finished in seconds where 1 percent of task taking minutes to hour).

How to handle skewed data in spark.

Selvam R

Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni