spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Subquery performance
Date Fri, 18 Mar 2016 18:14:55 GMT
If you encode the data in something like parquet we usually have more
information and will try to broadcast.

On Thu, Mar 17, 2016 at 7:27 PM, Younes Naguib <
Younes.Naguib@tritondigital.com> wrote:

> Anyways to cache the subquery or force a broadcast join without persisting
> it?
>
>
>
> y
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* March-17-16 8:59 PM
> *To:* Younes Naguib
> *Cc:* user@spark.apache.org
> *Subject:* Re: Subquery performance
>
>
>
> Try running EXPLAIN on both version of the query.
>
>
>
> Likely when you cache the subquery we know that its going to be small so
> use a broadcast join instead of a shuffling the data.
>
>
>
> On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <
> Younes.Naguib@tritondigital.com> wrote:
>
> Hi all,
>
>
>
> I’m running a query that looks like the following:
>
> Select col1, count(1)
>
> From (Select col2, count(1) from tab2 group by col2)
>
> Inner join tab1 on (col1=col2)
>
> Group by col1
>
>
>
> This creates a very large shuffle, 10 times the data size, as if the
> subquery was executed for each row.
>
> Anything can be done to tune to help tune this?
>
> When the subquery in persisted, it runs much faster, and the shuffle is 50
> times smaller!
>
>
>
> *Thanks,*
>
> *Younes*
>
>
>

Mime
View raw message