spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Should I convert json into parquet?
Date Mon, 19 Oct 2015 05:32:24 GMT


Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy.   They
are much faster than JSON. however, the table structure is up to you and depends on your use
case.

> On 17 Oct 2015, at 23:07, Gavin Yue <yue.yuanyuan@gmail.com> wrote:
> 
> I have json files which contains timestamped events.  Each event associate with a user
id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...) 
> UserIDB-> (Event3...) 
> 
> Then I will label positives and featurize the Events Vector in many different ways, fit
each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many times.  And
there will new events coming every day. So I need to update this intermediate storage every
day. 
> 
> Right now I store intermediate storage using Json files.  Should I use Parquet instead?
 Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message