spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Rangole <arang...@gmail.com>
Subject Re: data storage formats
Date Mon, 09 Dec 2013 17:48:57 GMT
You can compress a csv or tab delimited file as well :)

You can specify the codec of your choice, say snappy, when writing out.
That's what we do.  You can also write out data as sequence files. RCFile
should also be possible given the flexibility of Spark API but we haven't
tried that.
On Dec 7, 2013 2:02 AM, "Ankur Chauhan" <achauhan@brightcove.com> wrote:

> Hi all,
>
> I am wondering what do people use as the on disk storage format. I have
> seen almost all the examples use csv files to store and load data but that
> seems too simplisting for obvious reasons (compressibility to name one). I
> was just interested to find out what people use to store computation
> results. For example consider that you did some computation on some log
> files and want to store all sorts of metrics for each and every user so
> that you can later use shark to query it interactively. What is the
> preferred or good format to store all the data? Parquet? RCFiles? csv? JSON?
>
> -- Ankur

Mime
View raw message