spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <>
Subject Re: Join with large data set
Date Fri, 17 Oct 2014 06:06:27 GMT
Hi Ankur,

If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.


Best Regards,
Nube Technologies <>


On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava <> wrote:

> Hi,
> I have a rdd which is my application data and is huge. I want to join this
> with reference data which is also huge to fit in-memory and thus I do not
> want to use Broadcast variable.
> What other options do I have to perform such joins?
> I am using Cassandra as my data store, so should I just query cassandra to
> get the reference data needed?
> Also when I join two rdds, will it result in rdd scan or would spark do a
> hash partition on the two rdds to get the data with same keys on same node?
> Thanks
> Ankur

View raw message