spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Controlling the name of the output file
Date Thu, 10 Oct 2013 21:32:48 GMT
Hi Ramkumar,

I don't think there's a good way to give them different names other than opening and writing
the files yourself. You could do that with a foreach(). For example, suppose you created and
RDD of records (say (key, listOfValues)) and you wanted to save each one to a different file
based on the key. You could do

records.foreach { rec =>
  val out = // open FileOutputStream for rec. key
  // write values to out

You can access HDFS directly through the FileSystem class in Hadoop:
Just use FileSystem.get(uri, configuration) to get the FileSystem object.

Otherwise, if that doesn't work, you can also rename the part-X files through the same Filesystem
API above.


On Oct 10, 2013, at 4:18 AM, Ramkumar Chokkalingam <> wrote:

> Hello, 
> I'm writing reading multiple files, parsing them, and writing to an output file. As I
see it, SaveAsTextFile takes the output path and emits the output under the directory we specify
as file named part-00000, part-00001 etc depending on the number of clusters used ( similar
to Hadoop).But is there a way, where you can make all your input files to be emitted in a
single output folder ? Also, do we have control over the output file name (Different name
rather than part-0000's) ?

View raw message