spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Ren <>
Subject Big performance difference when joining 3 tables in different order
Date Thu, 04 Jun 2015 14:10:59 GMT

I encountered a performance issue when join 3 tables in sparkSQL.

Here is the query:

SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, t_zipcode z, click_meter_site_grouped g
WHERE c.refCategoryID = g.category AND z.regionCode = g.region

I need to pay a lot of attention to the table order in FROM clause, if not, 
some order makes the driver broken, 
some order makes the job extremely slow,
only one order makes the job finished quickly.

For the slow one, I noticed a table is loaded 56 times !!! from its CSV

I would like to know more about join implement in SparkSQL the understand
the issue (auto broadcast, etc).

For ones want to know more about the details, here is the jira:

Any help is welcome. =) Thx


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message