spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Efficient filtering on Spark SQL dataframes with ordered keys
Date Mon, 31 Oct 2016 12:57:39 GMT
Hi Michael,

As I see you are caching the table already sorted

val keyValRDDSorted = keyValRDD.sortByKey().cache

and the next stage is you are creating multiple tempTables (different
ranges) that cache a subset of rows already cached in RDD. The data stored
in tempTable is in Hive columnar format (I assume that means ORC format)

Well that is all you can do. Bear in mind that these tempTables are
immutable and I do not know any way of dropping tempTable to free more
memory.

Depending on the size of the main table, caching the whole table may
require a lot of memory but you can see this in UI storage page.
Alternative is to use persist(StorageLevel.MEMORY_AND_DISK_SER()) with a
mix of cached and disk.

HTH







Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 10:55, Michael David Pedersen <
michael.d.pedersen@googlemail.com> wrote:

> Hi Mich,
>
> Thank you for your quick reply!
>
> What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>>
>
> It is a custom datasource, but ultimately backed by HBase.
>
>
>> By Key you mean a UNIQUE ID or something similar and then you do multiple
>> scans on the tempTable which stores data using in-memory columnar format.
>>
>
> The key is a unique ID, yes. But note that I don't actually do multiple
> scans on the same temp table: I create a new temp table for every query I
> want to run, because each query will be based on a different key range. The
> caching is at the level of the full key-value RDD.
>
> If I did instead cache the temp table, I don't see a way of exploiting key
> ordering for key range filters?
>
>
>> That is the optimisation of tempTable storage as far as I know.
>>
>
> So it seems to me that my current solution won't be using this
> optimisation, as I'm caching the RDD rather than the temp table.
>
>
>> Have you tried it using predicate push-down on the underlying table
>> itself?
>>
>
> No, because I essentially want to load the entire table into memory before
> doing any queries. At that point I have nothing to push down.
>
> Cheers,
> Michael
>

Mime
View raw message