spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <jaon...@gmail.com>
Subject Yet another question on saving RDD into files
Date Sat, 22 Mar 2014 09:49:36 GMT
Dear all,

As a Spark newbie, I need some help to understand how RDD save to file
behaves. After reading the post on saving single files efficiently

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html

I understand that each partition of the RDD is saved into a separate file,
isn't it ? And in order to get one single file, one should call
coalesce(1,shuffle=true), right ?

The other use case that I have is : append a RDD into existing file. Is it
possible with spark ? Precisely, I have a map transformation that results
vary over time, like a big time series :
 I need to store the result  for further analysis but if I store the RDD in
a different file each time I run the computation I may end with many little
files. A pseudo code of my process is as follow :

every tamestamp do
    RDD[Array[Double]].map -> RDD[(timestamp,Double)].save to the same file

What should be the best solution to that ?

Best

Mime
View raw message