As a Spark newbie, I need some help to understand how RDD save to file behaves. After reading the post on saving single files efficiently
I understand that each partition of the RDD is saved into a separate file, isn't it ? And in order to get one single file, one should call coalesce(1,shuffle=true), right ?
The other use case that I have is : append a RDD into existing file. Is it possible with spark ? Precisely, I have a map transformation that results vary over time, like a big time series :
I need to store the result for further analysis but if I store the RDD in a different file each time I run the computation I may end with many little files. A pseudo code of my process is as follow :
every tamestamp do
RDD[Array[Double]].map -> RDD[(timestamp,Double)].save to the same file
What should be the best solution to that ?