spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Split RDD and save as separate files
Date Wed, 11 Sep 2013 06:19:00 GMT
Hi Nicholas,

Right now the best way to do this is probably to run foreach() on each value and then use
the Hadoop FileSystem API directly to write a file. It has a pretty simple API based on OutputStreams:
http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html. You just have
to call FileSystem.get(URI, Configuration) and then call create() on it to write a file. You
may want to put the file into a temp location first and only rename it to the final name after
the task is successful to deal well with task failures.

Matei

On Sep 10, 2013, at 10:16 PM, Nicholas Pritchard <nicholas.pritchard@falkonry.com> wrote:

> Hi,
> 
> I have an RDD of (Key, Value) pairs that I would like to save to HDFS. However, rather
than putting everything into one file, I would like to split the RDD by key and save each
part as a separate file. The key would become the filename. 
> 
> In short, I am trying to do something like this:
> myRDD.groupByKey().foreach{ case(key, values) => values.saveAsTextFile(key) } 
> 
> This obviously doesn't work since values is of type Seq[V] instead of RDD[V], but does
anyone have any suggestions for doing this efficiently? Currently, I am repeatedly filtering
and saving the RDD, but this seems inefficient.
> 
> Thanks,
> Nick


Mime
View raw message