spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Night Wolf <nightwolf...@gmail.com>
Subject Re: Columnar-Oriented RDDs
Date Sun, 01 Mar 2015 11:33:35 GMT
Thanks for the comments guys.

Parquet is awesome. My question with using Parquet for on disk storage -
how should I load that into memory as a spark RDD and cache it and keep it
in a columnar format?

I know I can use Spark SQL with parquet which is awesome. But as soon as I
step out of SQL we have problems as it kinda gets converted back to a row
oriented format.

@Koert - that looks really exciting. Do you have any statistics on memory
and scan performance?

On Saturday, February 14, 2015, Koert Kuipers <koert@tresata.com> wrote:

> i wrote a proof of concept to automatically store any RDD of tuples or
> case classes in columar format using arrays (and strongly typed, so you get
> the benefit of primitive arrays). see:
> https://github.com/tresata/spark-columnar
>
> On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <michael@databricks.com
> <javascript:_e(%7B%7D,'cvml','michael@databricks.com');>> wrote:
>
>> Shark's in-memory code was ported to Spark SQL and is used by default
>> when you run .cache on a SchemaRDD or CACHE TABLE.
>>
>> I'd also look at parquet which is more efficient and handles nested data
>> better.
>>
>> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <nightwolfzor@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nightwolfzor@gmail.com');>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to build/use column oriented RDDs in some of my Spark code. A
>>> normal Spark RDD is stored as row oriented object if I understand
>>> correctly.
>>>
>>> I'd like to leverage some of the advantages of a columnar memory format.
>>> Shark (used to) and SparkSQL uses a columnar storage format using primitive
>>> arrays for each column.
>>>
>>> I'd be interested to know more about this approach and how I could build
>>> my own custom columnar-oriented RDD which I can use outside of Spark SQL.
>>>
>>> Could anyone give me some pointers on where to look to do something like
>>> this, either from scratch or using whats there in the SparkSQL libs or
>>> elsewhere. I know Evan Chan in a presentation made mention of building a
>>> custom RDD of column-oriented blocks of data.
>>>
>>> Cheers,
>>> ~N
>>>
>>
>>
>

Mime
View raw message