spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: Iterator over RDD in PySpark
Date Fri, 01 Aug 2014 17:49:37 GMT
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.


On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfriend@gmail.com> wrote:

> Is there a way to get iterator from RDD? Something like rdd.collect(), but
> returning lazy sequence and not single array.
>
> Context: I need to GZip processed data to upload it to Amazon S3. Since
> archive should be a single file, I want to iterate over RDD, writing each
> line to a local .gz file. File is small enough to fit local disk, but still
> large enough not to fit into memory.
>

Mime
View raw message