spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Efficient filtering on Spark SQL dataframes with ordered keys
Date Tue, 01 Nov 2016 22:45:56 GMT
registerTempTable is backed by an in-memory hash table that maps table name
(a string) to a logical query plan.  Fragments of that logical query plan
may or may not be cached (but calling register alone will not result in any
materialization of results).  In Spark 2.0 we renamed this function to
createOrReplaceTempView, since a traditional RDBMs view is a better analogy
here.

If I was trying to augment the engine to make better use of HBase's
internal ordering, I'd probably use the experimental ability to inject
extra strategies into the query planner.  Essentially, you could look for
filters on top of BaseRelations (the internal class used to map DataSources
into the query plan) where there is a range filter on some prefix of the
table's key.  When this is detected, you could return an RDD that contains
the already filtered result talking directly to HBase, which would override
the default execution pathway.

I wrote up a (toy) example of using this API
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3897837307833148/2840265927289860/latest.html>,
which might be helpful.

On Tue, Nov 1, 2016 at 4:11 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> it would be great if we establish this.
>
> I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are
> private to the session and are put in a hidden staging directory as below
>
> /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-
> 47_319_5605745346163312826-10
>
> and removed when the session ends or table is dropped
>
> Not sure how Spark handles this.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 November 2016 at 10:50, Michael David Pedersen <michael.d.pedersen@
> googlemail.com> wrote:
>
>> Thanks for the link, I hadn't come across this.
>>
>> According to https://forums.databricks.com/questions/400/what-is-the-diff
>>> erence-between-registertemptable-a.html
>>>
>>> and I quote
>>>
>>> "registerTempTable()
>>>
>>> registerTempTable() creates an in-memory table that is scoped to the
>>> cluster in which it was created. The data is stored using Hive's
>>> highly-optimized, in-memory columnar format."
>>>
>> But then the last post in the thread corrects this, saying:
>> "registerTempTable does not create a 'cached' in-memory table, but rather
>> an alias or a reference to the DataFrame. It's akin to a pointer in C/C++
>> or a reference in Java".
>>
>> So - probably need to dig into the sources to get more clarity on this.
>>
>> Cheers,
>> Michael
>>
>
>

Mime
View raw message