spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Colombo <ing.marco.colo...@gmail.com>
Subject Re: Possible to push sub-queries down into the DataSource impl?
Date Wed, 27 Jul 2016 14:04:08 GMT
Why don't you create a dataframe filtered, map it as temporary table and
then use it in your query? You can also cache it, of multiple queries on
the same inner queries are requested.

Il mercoledì 27 luglio 2016, Timothy Potter <thelabdude@gmail.com> ha
scritto:

> Take this simple join:
>
> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>
> I would like the ability to push the inner sub-query aliased as "solr"
> down into the data source engine, in this case Solr as it will
> greatlly reduce the amount of data that has to be transferred from
> Solr into Spark. I would imagine this issue comes up frequently if the
> underlying engine is a JDBC data source as well ...
>
> Is this possible? Of course, my example is a bit cherry-picked so
> determining if a sub-query can be pushed down into the data source
> engine is probably not a trivial task, but I'm wondering if Spark has
> the hooks to allow me to try ;-)
>
> Cheers,
> Tim
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <javascript:;>
>
>

-- 
Ing. Marco Colombo

Mime
View raw message