spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramkumar Chokkalingam <>
Subject Re: Output to a single directory with multiple files rather multiple directories ?
Date Fri, 11 Oct 2013 19:17:04 GMT
Thanks for the recommendation,Mark.

I have Setup Hadoop and was using the HDFS to run my MR jobs, hence I
assume it wouldn't take much of time to start using them from Spark code.I
can write scripts to move them to HDFS before running my spark code.
Since, You suggested I don't need to call parallelize() on any object,
should I go with the following approach,

*Reading input from HDFS as a file each,*
* output = Parse the file *
*Writing the output to a HFS file using HADOOP API*
* Repeat the process for all input files*

Should this be the pipeline I must be following, given that my input files
are ~4MB each, and I process(parse) a file each Where/How does the
parallelization (of my parsing )happens ?

View raw message