spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toby Douglass <t...@avocet.io>
Subject Re: initial basic question from new user
Date Thu, 12 Jun 2014 14:47:06 GMT
On Thu, Jun 12, 2014 at 3:15 PM, FRANK AUSTIN NOTHAFT <fnothaft@berkeley.edu
> wrote:

> RE:
>
> > Given that our agg sizes will exceed memory, we expect to cache them to
> disk, so save-as-object (assuming there are no out of the ordinary
> performance issues) may solve the problem, but I was hoping to store data
> is a column orientated format.  However I think this in general is not
> possible - Spark can *read* Parquet, but I think it cannot write Parquet as
> a disk-based RDD format.
>
> Spark can write Parquet, via the ParquetOutputFormat which is distributed
> from Parquet. If you'd like example code for writing out to Parquet, please
> see the adamSave function in
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala,
> starting at line 62. There is a bit of setup necessary for the Parquet
> write codec, but otherwise it is fairly straightforward.
>

Thankyou, Frank.

My thought is to generate an aggregated RDD from our full data set, where
the aggregated RDD will be about 10% of the size of the full data set, and
will be stored to disk in column store, to be loaded by future jobs.

In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query?  or will Spark
load everything, to convert it into an internal representation, and then
execute the query?

Mime
View raw message