spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <inv...@gmail.com>
Subject Big performance difference when joining 3 tables in different order
Date Thu, 04 Jun 2015 14:10:59 GMT
Hi,

I encountered a performance issue when join 3 tables in sparkSQL.

Here is the query:

SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, t_zipcode z, click_meter_site_grouped g
WHERE c.refCategoryID = g.category AND z.regionCode = g.region

I need to pay a lot of attention to the table order in FROM clause, if not, 
some order makes the driver broken, 
some order makes the job extremely slow,
only one order makes the job finished quickly.

For the slow one, I noticed a table is loaded 56 times !!! from its CSV
file.

I would like to know more about join implement in SparkSQL the understand
the issue (auto broadcast, etc).

For ones want to know more about the details, here is the jira:
https://issues.apache.org/jira/browse/SPARK-8102

Any help is welcome. =) Thx

Hao



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Big-performance-difference-when-joining-3-tables-in-different-order-tp23150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message