spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Question on Spark 1.3 SQL External Datasource
Date Wed, 18 Mar 2015 00:05:20 GMT
Hey Yang,

My comments are in-lined below.

Cheng

On 3/18/15 6:53 AM, Yang Lei wrote:
> Hello,
>
> I am migrating my Spark SQL external datasource integration from Spark 
> 1.2.x to Spark 1.3.
>
> I noticed, there are a couple of new filters now,  e.g. 
> org.apache.spark.sql.sources.And. However, for a sql with condition "A 
> AND B", I noticed PrunedFilteredScan.buildScan still gets an 
> Array[Filter] with 2 filters of A and B, while I have expected to get 
> one "And" filter with left == A and right == B.
>
> So my first question is: where I can find out the "rules" for 
> converting a SQL condition to the filters passed to 
> the PrunedFilteredScan.buildScan.
Top level AND predicates are always broken into smaller sub-predicates. 
The AND filter appeared in the external data sources API is for nested 
predicates, like A OR (NOT (B AND C)).
>
> I do like what I see on these And, Or, Not filters where we allow 
> recursive nested definition to connect filters together. If this is 
> the direction we are heading to, my second question is:  if we just 
> need one Filter object instead of Array[Filter] on the buildScan.
For data sources with further filter push-down ability (e.g. Parquet), 
breaking down top level AND predicates for them can be convenient.
>
> The third question is: what our plan is to allow a relation provider 
> to inform Spark which filters are handled already, so that there is 
> no redundant filtering.
Yeah, this is a good point, I guess we can add some method like 
"filterAccepted" to PrunedFilteredScan.
>
> Appreciate comments and links to any existing documentation or discussion.
>
>
> Yang


Mime
View raw message