spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Farooqui <same...@databricks.com>
Subject Re: how to convert an rdd to a single output file
Date Fri, 12 Dec 2014 20:22:16 GMT
Instead of doing this on the compute side, I would just write out the file
with different blocks initially into HDFS and then use "hadoop fs
-getmerge" or HDFSConcat to get one final output file.


- SF

On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis <lordjoe2000@gmail.com> wrote:
>
>
> I have an RDD which is potentially too large to store in memory with
> collect. I want a single task to write the contents as a file to hdfs. Time
> is not a large issue but memory is.
> I say the following converting my RDD (scans) to a local Iterator. This
> works but hasNext shows up as a separate task and takes on the order of 20
> sec for a medium sized job -
> is *toLocalIterator a bad function to call in this case and is there a
> better one?*
>
>
>
>
>
>
>
>
>
>
>
> *public void writeScores(final Appendable out, JavaRDD<IScoredScan> scans) {  
 writer.appendHeader(out, getApplication());    Iterator<IScoredScan> scanIterator =
scans.toLocalIterator();    while(scanIterator.hasNext())  {        IScoredScan scan = scanIterator.next();
       writer.appendScan(out, getApplication(), scan);    }    writer.appendFooter(out, getApplication());}*
>
>
>

Mime
View raw message