spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cjno...@gmail.com>
Subject Re: Spark SQL Custom Predicate Pushdown
Date Sat, 17 Jan 2015 03:17:19 GMT
Hao,

Thanks so much for the links! This is exactly what I'm looking for. If I
understand correctly, I can extend PrunedFilteredScan, PrunedScan, and
TableScan and I should be able to support all the sql semantics?

I'm a little confused about the Array[Filter] that is used with the
Filtered scan. I have the ability to perform pretty robust seeks in the
underlying data sets in Accumulo. I have an inverted index and I'm able to
do intersections as well as unions- and rich predicates which form a tree
of alternating intersections and unions. If I understand correctly- the
Array[Filter] is to be treated as an AND operator? Do OR operators get
propagated through the API at all? I'm trying to do as much pairing down of
the dataset as possible on the individual tablet servers so that the data
loaded into the spark layer is minimal- really used to perform joins,
groupBys, sortBys and other computations that would require the relations
to be combined in various ways.

Thanks again for pointing me to this.



On Fri, Jan 16, 2015 at 2:07 AM, Cheng, Hao <hao.cheng@intel.com> wrote:

>  The Data Source API probably work for this purpose.
>
> It support the column pruning and the Predicate Push Down:
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>
>
>
> Examples also can be found in the unit test:
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources
>
>
>
>
>
> *From:* Corey Nolet [mailto:cjnolet@gmail.com]
> *Sent:* Friday, January 16, 2015 1:51 PM
> *To:* user
> *Subject:* Spark SQL Custom Predicate Pushdown
>
>
>
> I have document storage services in Accumulo that I'd like to expose to
> Spark SQL. I am able to push down predicate logic to Accumulo to have it
> perform only the seeks necessary on each tablet server to grab the results
> being asked for.
>
>
>
> I'm interested in using Spark SQL to push those predicates down to the
> tablet servers. Where wouldI begin my implementation? Currently I have an
> input format which accepts a "query object" that gets pushed down. How
> would I extract this information from the HiveContext/SQLContext to be able
> to push this down?
>

Mime
View raw message