spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianshi Huang <jianshi.hu...@gmail.com>
Subject Re: Auto BroadcastJoin optimization failed in latest Spark
Date Fri, 05 Dec 2014 07:30:08 GMT
Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.cheng@intel.com> wrote:

>  Hi Jianshi,
>
> I couldn’t reproduce that with latest MASTER, and I can always get the
> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
> there any external tables in your case?
>
>
>
> In general probably couple of things you can try first (with HiveContext):
>
> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
> of the tables);
>
> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
> as a greater value, it is 1024*1024*10 by default, just make sure the
> maximum dimension tables size (in bytes) is less than this)
>
> 3)      Always put the main table(the biggest table) in the left-most
> among the inner joins;
>
>
>
> DESC EXTENDED tablename; -- this will print the detail information for the
> statistic table size (the field “totalSize”)
>
> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>
>
>
> Let me know if you still have problem.
>
>
>
> Hao
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Thursday, November 27, 2014 10:24 PM
> *To:* Cheng, Hao
> *Cc:* user
> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi Hao,
>
>
>
> I'm using inner join as Broadcast join didn't work for left joins (thanks
> for the links for the latest improvements).
>
>
>
> And I'm using HiveConext and it worked in a previous build (10/12) when
> joining 15 dimension tables.
>
>
>
> Jianshi
>
>
>
> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.cheng@intel.com> wrote:
>
>  Are all of your join keys the same? and I guess the join type are all
> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
> you need.
>
>
>
> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
> join) currently,  https://github.com/apache/spark/pull/3270 should be
> another optimization for this.
>
>
>
>
>
> *From:* Jianshi Huang [mailto:jianshi.huang@gmail.com]
> *Sent:* Wednesday, November 26, 2014 4:36 PM
> *To:* user
> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>
>
>
> Hi,
>
>
>
> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
> optimizing auto broadcast join in my query. I have a query that joins a
> huge fact table with 15 tiny dimension tables.
>
>
>
> I'm currently using an older version of Spark which was built on Oct. 12.
>
>
>
> Anyone else has met similar situation?
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message