spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Is pair rdd join more efficient than regular rdd
Date Mon, 02 Feb 2015 19:27:27 GMT
Yes it would, you can create a key and then partition it (say
HashPartitioner) and then joining would be faster as all the similar keys
will go in one partition.

Thanks
Best Regards

On Sun, Feb 1, 2015 at 5:13 PM, Sunita Arvind <sunitarvind@gmail.com> wrote:

> Hi All
>
> We are joining large tables using spark sql and running into shuffle
> issues. We have explored multiple options - using coalesce to reduce number
> of partitions, tuning various parameters like disk buffer, reducing data in
> chunks etc. which all seem to help btw. What I would like to know is,
> is having a pair rdd over regular rdd one of the solutions ? Will it make
> the joining more efficient as spark can shuffle better since it knows the
> key? Logically speaking I think it should help but I haven't found any
> evidence on the internet including the spark sql documentation.
>
> It is a lot of effort for us to try this approach and weight the
> performance as we need to register the output as tables to proceed using
> them. Hence would appreciate inputs from the community before proceeding.
>
>
> Regards
> Sunita Koppar
>
>

Mime
View raw message