spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramkumar Chokkalingam <ramkumar...@gmail.com>
Subject Re: Output to a single directory with multiple files rather multiple directories ?
Date Fri, 11 Oct 2013 05:10:47 GMT
Thanks both for you time. To make it clear before I start off -

>From my input folder,
Read all the filenames into a Spark RDD, say InputFilesRDD
Call InputFilesRDD.parallelize() on that collection [which would split my
input data filenames among various clusters]
outputRDD = InputFilesRDD.foreach(filename => {Read the file [from local
disk ?] and parse})
write the output(outputRDD) to Hadoop DFS using Hadoop API.

So, in this pipeline -> my input will be in my local disk[read from] and
only while writing , I write[output] to Hadoop FileSystem as multiple files
?

I find some Hadoop API's under
JavaSparkContext<http://spark.incubator.apache.org/docs/0.6.1/api/core/spark/api/java/JavaSparkContext.html>
and
a dedicated Hadoop API
NewHadoopRDD<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.NewHadoopRDD>
.
Is this what you are were referring to ?

Mime
View raw message