spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Controlling the name of the output file
Date Thu, 10 Oct 2013 21:32:48 GMT
Hi Ramkumar,

I don't think there's a good way to give them different names other than opening and writing
the files yourself. You could do that with a foreach(). For example, suppose you created and
RDD of records (say (key, listOfValues)) and you wanted to save each one to a different file
based on the key. You could do

records.foreach { rec =>
  val out = // open FileOutputStream for rec. key
  // write values to out
  out.close()
}

You can access HDFS directly through the FileSystem class in Hadoop: http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html.
Just use FileSystem.get(uri, configuration) to get the FileSystem object.

Otherwise, if that doesn't work, you can also rename the part-X files through the same Filesystem
API above.

Matei

On Oct 10, 2013, at 4:18 AM, Ramkumar Chokkalingam <ramkumar.au@gmail.com> wrote:

> Hello, 
> 
> I'm writing reading multiple files, parsing them, and writing to an output file. As I
see it, SaveAsTextFile takes the output path and emits the output under the directory we specify
as file named part-00000, part-00001 etc depending on the number of clusters used ( similar
to Hadoop).But is there a way, where you can make all your input files to be emitted in a
single output folder ? Also, do we have control over the output file name (Different name
rather than part-0000's) ?
> 
> 


Mime
View raw message