spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <>
Subject Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?
Date Thu, 12 Mar 2015 18:24:57 GMT
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.

Note this would be true for *any* transformation which causes a shuffle. It
would not be true if you're combining RDDs with union, since that doesn't
cause a shuffle.

On Thu, Mar 12, 2015 at 11:04 AM, shahab <> wrote:

> Hi,
> Probably this question is already answered sometime in the mailing list,
> but i couldn't find it. Sorry for posting this again.
> I need to to join and apply filtering on three different RDDs, I just
> wonder which of the following alternatives are more efficient:
> 1- first joint all three RDDs and then do  filtering on resulting joint
> RDD   or
> 2- Apply filtering on each individual RDD and then join the resulting RDDs
> Or probably there is no difference due to lazy evaluation and under
> beneath Spark optimisation?
> best,
> /Shahab

View raw message