spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sadhan Sood <sadhan.s...@gmail.com>
Subject Re: does spark sql support columnar compression with encoding when caching tables
Date Sat, 20 Dec 2014 01:43:13 GMT
Thanks Michael, that makes sense.

On Fri, Dec 19, 2014 at 3:13 PM, Michael Armbrust <michael@databricks.com>
wrote:

> Yeah, tachyon does sound like a good option here.  Especially if you have
> nested data, its likely that parquet in tachyon will always be better
> supported.
>
> On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood <sadhan.sood@gmail.com>
> wrote:
>>
>> Hey Michael,
>>
>> Thank you for clarifying that. Is tachyon the right way to get compressed
>> data in memory or should we explore the option of adding compression to
>> cached data. This is because our uncompressed data set is too big to fit in
>> memory right now. I see the benefit of tachyon not just with storing
>> compressed data in memory but we wouldn't have to create a separate table
>> for caching some partitions like 'cache table table_cached as select * from
>> table where date = 201412XX' - the way we are doing right now.
>>
>>
>> On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust <michael@databricks.com
>> > wrote:
>>>
>>> There is only column level encoding (run length encoding, delta
>>> encoding, dictionary encoding) and no generic compression.
>>>
>>> On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood <sadhan.sood@gmail.com>
>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> Wondering if when caching a table backed by lzo compressed parquet
>>>> data, if spark also compresses it (using lzo/gzip/snappy) along with column
>>>> level encoding or just does the column level encoding when "*spark.sql.inMemoryColumnarStorage.compressed"
>>>> *is set to true. This is because when I try to cache the data, I
>>>> notice the memory being used is almost as much as the uncompressed size of
>>>> the data.
>>>>
>>>> Thanks!
>>>>
>>>

Mime
View raw message