spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <atan...@adobe.com>
Subject Re: Should I convert json into parquet?
Date Mon, 19 Oct 2015 13:47:16 GMT
For general data access of the pre-computed aggregates (group by) you’re better off with
Parquet. I’d only choose JSON if I needed interop with another app stack / language that
has difficulty accessing parquet (E.g. Bulk load into document db…).

On a strategic level, both JSON and parquet are similar since neither give you good random
access, so you can’t simply “update specific user Ids on new data coming in”. Your strategy
will probably be to re-process all the users by loading new data and current aggregates, joining
and writing a new version of the aggregates…

If you’re worried about update performance then you probably need to look at a DB that offers
random write access (Cassandra, Hbase..)

-adrian




On 10/19/15, 12:31 PM, "Ewan Leith" <ewan.leith@realitymine.com> wrote:

>As Jörn says, Parquet and ORC will get you really good compression and can be much faster.
There also some nice additions around predicate pushdown which can be great if you've got
wide tables.
>
>Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described
here http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/
>
>Thanks,
>Ewan
>
>-----Original Message-----
>From: Jörn Franke [mailto:jornfranke@gmail.com] 
>Sent: 19 October 2015 06:32
>To: Gavin Yue <yue.yuanyuan@gmail.com>
>Cc: user <user@spark.apache.org>
>Subject: Re: Should I convert json into parquet?
>
>
>
>Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy.
  They are much faster than JSON. however, the table structure is up to you and depends on
your use case.
>
>> On 17 Oct 2015, at 23:07, Gavin Yue <yue.yuanyuan@gmail.com> wrote:
>> 
>> I have json files which contains timestamped events.  Each event associate with a
user id. 
>> 
>> Now I want to group by user id. So converts from
>> 
>> Event1 -> UserIDA;
>> Event2 -> UserIDA;
>> Event3 -> UserIDB;
>> 
>> To intermediate storage. 
>> UserIDA -> (Event1, Event2...)
>> UserIDB-> (Event3...)
>> 
>> Then I will label positives and featurize the Events Vector in many different ways,
fit each of them into the Logistic Regression. 
>> 
>> I want to save intermediate storage permanently since it will be used many times.
 And there will new events coming every day. So I need to update this intermediate storage
every day. 
>> 
>> Right now I store intermediate storage using Json files.  Should I use Parquet instead?
 Or is there better solutions for this use case?
>> 
>> Thanks a lot !
>> 
>> 
>> 
>> 
>> 
>> 
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail:
user-help@spark.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>For additional commands, e-mail: user-help@spark.apache.org
>
Mime
View raw message