spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Subquery performance
Date Fri, 18 Mar 2016 00:59:03 GMT
Try running EXPLAIN on both version of the query.

Likely when you cache the subquery we know that its going to be small so
use a broadcast join instead of a shuffling the data.

On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <
Younes.Naguib@tritondigital.com> wrote:

> Hi all,
>
>
>
> I’m running a query that looks like the following:
>
> Select col1, count(1)
>
> From (Select col2, count(1) from tab2 group by col2)
>
> Inner join tab1 on (col1=col2)
>
> Group by col1
>
>
>
> This creates a very large shuffle, 10 times the data size, as if the
> subquery was executed for each row.
>
> Anything can be done to tune to help tune this?
>
> When the subquery in persisted, it runs much faster, and the shuffle is 50
> times smaller!
>
>
>
> *Thanks,*
>
> *Younes*
>

Mime
View raw message