spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: Possible to push sub-queries down into the DataSource impl?
Date Mon, 01 Aug 2016 15:45:27 GMT
yes, that's exactly what I was looking for, thanks for the pointer ;-)

On Thu, Jul 28, 2016 at 1:07 AM, Takeshi Yamamuro <linguin.m.s@gmail.com> wrote:
> Hi,
>
> Have you seen this ticket?
> https://issues.apache.org/jira/browse/SPARK-12449
>
> // maropu
>
> On Thu, Jul 28, 2016 at 2:13 AM, Timothy Potter <thelabdude@gmail.com>
> wrote:
>>
>> I'm not looking for a one-off solution for a specific query that can
>> be solved on the client side as you suggest, but rather a generic
>> solution that can be implemented within the DataSource impl itself
>> when it knows a sub-query can be pushed down into the engine. In other
>> words, I'd like to intercept the query planning process to be able to
>> push-down computation into the engine when it makes sense.
>>
>> On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
>> <ing.marco.colombo@gmail.com> wrote:
>> > Why don't you create a dataframe filtered, map it as temporary table and
>> > then use it in your query? You can also cache it, of multiple queries on
>> > the
>> > same inner queries are requested.
>> >
>> >
>> > Il mercoledì 27 luglio 2016, Timothy Potter <thelabdude@gmail.com> ha
>> > scritto:
>> >>
>> >> Take this simple join:
>> >>
>> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> >> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>> >>
>> >> I would like the ability to push the inner sub-query aliased as "solr"
>> >> down into the data source engine, in this case Solr as it will
>> >> greatlly reduce the amount of data that has to be transferred from
>> >> Solr into Spark. I would imagine this issue comes up frequently if the
>> >> underlying engine is a JDBC data source as well ...
>> >>
>> >> Is this possible? Of course, my example is a bit cherry-picked so
>> >> determining if a sub-query can be pushed down into the data source
>> >> engine is probably not a trivial task, but I'm wondering if Spark has
>> >> the hooks to allow me to try ;-)
>> >>
>> >> Cheers,
>> >> Tim
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ing. Marco Colombo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
>
> --
> ---
> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message