spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evan Chan (JIRA)" <>
Subject [jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
Date Sat, 18 Feb 2017 01:35:44 GMT


Evan Chan updated SPARK-13219:

Hi Gagan,

That is an interesting optimization but not the same one that Venu speaks of (I worked on
those optimizations).  Basically those optimizations are for where the column name in the
WHERE clause are present in both tables, and my impression is this is what this fix is for
as well.

Your case would be very useful too.  You can do it in two steps though, first do the lookup
of postal codes from location, then translate your select from address into an IN condition.

Of course it’s better if Spark does this so that the results don’t have to be passed back
through the driver.

> Pushdown predicate propagation in SparkSQL with join
> ----------------------------------------------------
>                 Key: SPARK-13219
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.2, 1.6.0
>         Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>            Reporter: Abhinav Chawade
> When 2 or more tables are joined in SparkSQL and there is an equality clause in query
on attributes used to perform the join, it is useful to apply that clause on scans for both
table. If this is not done, one of the tables results in full scan which can reduce the query
dramatically. Consider following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
>     assetid int PRIMARY KEY,
>     address text,
>     propertyname text
> )
> CREATE TABLE tenants (
>     assetid int PRIMARY KEY,
>     name text
> )
> spark-sql> explain select from tenants t, assets a where a.assetid = t.assetid
and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>    Filter (CAST(assetid#13, DoubleType) = 1201.0)
>     HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, Some(t)),
>   Exchange (HashPartitioning 200)
>    HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but it becomes
cumbersome. It will be helpful if the query planner could improve filter propagation.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message