spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramkumar Chokkalingam <>
Subject Output to a single directory with multiple files rather multiple directories ?
Date Mon, 07 Oct 2013 03:51:14 GMT

I have started experimenting with Spark Cluster. I have a parallelization
job where I want to parse through several folders and each of them has
multiple files,which I parse and do some file processing on the files'
records and write the whole file back to a output file. I do the same
processing operation(Hashing certain fields in the data file) for all the
files inside the Folder. Simply,

*For a directory D, *
*  Read all files inside D. *
*    For each File F
*      Loop: For each line L in File, I do some processing and write my
processing output to a file. *

So if there are 200 files inside input directory - I would like to have 200
files in my output directory. I learnt that with *SaveAsTextFile(Name) *API
spark creates a directory with the name we specify (Name) and creates the
actual output files inside that folder in the form of part-00000,part-00001
etc.. files ( similar to Hadoop, I assumed).
My question is there a way where we specify the name of the output
directory and *redirect all my SaveAsTextFile(DirName) outputs into a
single folder* rather ?

Let me know if there is a way of achieving this. If not, I would appreciate
hearing some workarounds. Thanks!


Ramkumar Chokkalingam,
Masters Student, University of Washington || 206-747-3515

View raw message