rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though.


On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfriend@gmail.com> wrote:
Is there a way to get iterator from RDD? Something like rdd.collect(), but returning lazy sequence and not single array.

Context: I need to GZip processed data to upload it to Amazon S3. Since archive should be a single file, I want to iterate over RDD, writing each line to a local .gz file. File is small enough to fit local disk, but still large enough not to fit into memory.