spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: Skew data
Date Sat, 18 Jun 2016 04:32:49 GMT
I am going to take a guess that this means that your partitions within an
RDD are not balanced (one or more partitions are much larger than the
rest). This would mean a single core would need to do much more work than
the rest leading to poor performance. In general, the way to fix this is to
spread data across partitions evenly. In most cases calling repartition is
enough to solve the problem. If you have a special case you might need
create your own custom partitioner.

Pedro

On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman <selmna@gmail.com> wrote:

> Hi,
>
> What is skew data.
>
> I read that if the data was skewed while joining it would take long time
> to finish the job.(99 percent finished in seconds where 1 percent of task
> taking minutes to hour).
>
> How to handle skewed data in spark.
>
> Thanks,
> Selvam R
> +91-97877-87724
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Mime
View raw message