spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <hemant9...@gmail.com>
Subject Re: SPARK-13900 - Join with simple OR conditions take too long
Date Thu, 31 Mar 2016 10:37:14 GMT
Hi Ashok,

That's interesting.

As I understand, on table A and B, a nested loop join (that will produce m
X n rows) is performed and than each row is evaluated to see if any of the
condition is met. You are asking that Spark should instead do a
BroadcastHashJoin on the equality conditions in parallel and then union the
results like you are doing in a different query.

If we leave aside parallelism for a moment, theoretically, time taken for
nested loop join would vary little when the number of conditions are
increased while the time taken for the solution that you are suggesting
would increase linearly with number of conditions. So, when number of
conditions are too many, nested loop join would be faster than the solution
that you suggest. Now the question is, how should Spark decide when to do
what?


Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran <
ashokkumar.rajendran@gmail.com> wrote:

> Hi,
>
> I have filed ticket SPARK-13900. There was an initial reply from a
> developer but did not get any reply on this. How can we do multiple hash
> joins together for OR conditions based joins? Could someone please guide on
> how can we fix this?
>
> Regards
> Ashok
>

Mime
View raw message