spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramkumar Chokkalingam <>
Subject Re: Output to a single directory with multiple files rather multiple directories ?
Date Fri, 11 Oct 2013 05:10:47 GMT
Thanks both for you time. To make it clear before I start off -

>From my input folder,
Read all the filenames into a Spark RDD, say InputFilesRDD
Call InputFilesRDD.parallelize() on that collection [which would split my
input data filenames among various clusters]
outputRDD = InputFilesRDD.foreach(filename => {Read the file [from local
disk ?] and parse})
write the output(outputRDD) to Hadoop DFS using Hadoop API.

So, in this pipeline -> my input will be in my local disk[read from] and
only while writing , I write[output] to Hadoop FileSystem as multiple files

I find some Hadoop API's under
a dedicated Hadoop API
Is this what you are were referring to ?

View raw message