spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Output to a single directory with multiple files rather multiple directories ?
Date Thu, 10 Oct 2013 21:37:03 GMT
Hey, sorry, for this question, there's a similar answer to the previous one. You'll have to
move the files from the output directories into a common directory by hand, possibly renaming
them. The Hadoop InputFormat and OutputFormat APIs that we use are just designed to work at
the level of directories (one directory represents one dataset).

One other option may be to build a union of multiple RDDs, using SparkContext.union(rdd1,
rdd2, etc), and then call saveAsTextFile on that. Now they'll all be written to the same output
location.

Matei

On Oct 6, 2013, at 8:51 PM, Ramkumar Chokkalingam <ramkumar.au@gmail.com> wrote:

> 
> Hello, 
> 
> I have started experimenting with Spark Cluster. I have a parallelization job where I
want to parse through several folders and each of them has multiple files,which I parse and
do some file processing on the files' records and write the whole file back to a output file.
I do the same processing operation(Hashing certain fields in the data file) for all the files
inside the Folder. Simply, 
> 
> For a directory D, 
>   Read all files inside D. 
>     For each File F
>       Loop: For each line L in File, I do some processing and write my processing output
to a file. 
> 
> So if there are 200 files inside input directory - I would like to have 200 files in
my output directory. I learnt that with SaveAsTextFile(Name) API spark creates a directory
with the name we specify (Name) and creates the actual output files inside that folder in
the form of part-00000,part-00001 etc.. files ( similar to Hadoop, I assumed). 
> My question is there a way where we specify the name of the output directory and redirect
all my SaveAsTextFile(DirName) outputs into a single folder rather ?
> 
> Let me know if there is a way of achieving this. If not, I would appreciate hearing some
workarounds. Thanks!
> 
> 
> Regards,
> 
> Ramkumar Chokkalingam, 
> Masters Student, University of Washington || 206-747-3515
> 
>  
> 


Mime
View raw message