spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei <>
Subject Re: Iterator over RDD in PySpark
Date Fri, 01 Aug 2014 22:04:12 GMT
Thanks, Aaron, it should be fine with partitions (I can repartition it
anyway, right?).
But rdd.toLocalIterator is purely Java/Scala method. Is there Python
interface to it?
I can get Java iterator though rdd._jrdd, but it isn't converted to Python
iterator automatically. E.g.:

  >>> rdd = sc.parallelize([1, 2, 3, 4, 5])
  >>> it = rdd._jrdd.toLocalIterator()
  >>> next(it)
  14/08/02 01:02:32 INFO SparkContext: Starting job: apply at
  14/08/02 01:02:32 INFO SparkContext: Job finished: apply at
Iterator.scala:371, took 0.02064317 s

I understand that returned byte array somehow corresponds to actual data,
but how can I get it?

On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson <> wrote:

> rdd.toLocalIterator will do almost what you want, but requires that each
> individual partition fits in memory (rather than each individual line).
> Hopefully that's sufficient, though.
> On Fri, Aug 1, 2014 at 1:38 AM, Andrei <> wrote:
>> Is there a way to get iterator from RDD? Something like rdd.collect(),
>> but returning lazy sequence and not single array.
>> Context: I need to GZip processed data to upload it to Amazon S3. Since
>> archive should be a single file, I want to iterate over RDD, writing each
>> line to a local .gz file. File is small enough to fit local disk, but still
>> large enough not to fit into memory.

View raw message