spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Re: Stream RDD to local disk
Date Thu, 30 Jan 2014 10:41:38 GMT
Hadn't thought of calling the hadoop process from within the scala code but
that is an improvement over my current process. Thanks for the suggestion

It still requires saving to HDFS, dumping out to a file, and then cleaning
that temp directory out of HDFS though so isn't quite my ideal process.

Sent from my mobile phone
On Jan 30, 2014 2:37 AM, "Christopher Nguyen" <> wrote:

> Andrew, couldn't you do in the Scala code:
>   scala.sys.process.Process("hadoop fs -copyToLocal ...")!
> or is that still considered a second step?
> "hadoop fs" is almost certainly going to be better at copying these files
> than some memory-to-disk-to-memory serdes within Spark.
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <>
> On Thu, Jan 30, 2014 at 2:21 AM, Andrew Ash <> wrote:
>> Hi Spark users,
>> I'm often using Spark for ETL type tasks, where the input is a large file
>> on-disk and the output is another large file on-disk.  I've loaded
>> everything into HDFS, but still need to produce files out on the other side.
>> Right now I produce these processed files in a 2-step process:
>> 1) in a single spark job, read from HDFS location A, process, and write
>> to HDFS location B
>> 2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto
>> the local disk.
>> It would be great to get this down to a 1-step process.
>> If I run .saveAsTextFile("...") on my RDD, then the shards of the file
>> are scattered onto the local disk across the cluster.  But if I .collect()
>> on the driver and then save to disk using normal Scala disk IO utilities,
>> I'll certainly OOM the driver.
>> *So the question*: is there a way to get an iterator for an RDD that I
>> can scan through the contents on the driver and flush to disk?
>> I found the RDD.iterator() method but it looks to be intended for use by
>> RDD subclasses not end users (requires a Partition and TaskContext
>> parameter).  The .foreach() method executes on each worker also, rather
>> than on the driver, so would also scatter files across the cluster if I
>> saved from there.
>> Any suggestions?
>> Thanks!
>> Andrew

View raw message