spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei <faithlessfri...@gmail.com>
Subject Re: Iterator over RDD in PySpark
Date Fri, 01 Aug 2014 22:04:12 GMT
Thanks, Aaron, it should be fine with partitions (I can repartition it
anyway, right?).
But rdd.toLocalIterator is purely Java/Scala method. Is there Python
interface to it?
I can get Java iterator though rdd._jrdd, but it isn't converted to Python
iterator automatically. E.g.:

  >>> rdd = sc.parallelize([1, 2, 3, 4, 5])
  >>> it = rdd._jrdd.toLocalIterator()
  >>> next(it)
  14/08/02 01:02:32 INFO SparkContext: Starting job: apply at
Iterator.scala:371
  ...
  14/08/02 01:02:32 INFO SparkContext: Job finished: apply at
Iterator.scala:371, took 0.02064317 s
  bytearray(b'\x80\x02K\x01.')

I understand that returned byte array somehow corresponds to actual data,
but how can I get it?



On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson <ilikerps@gmail.com> wrote:

> rdd.toLocalIterator will do almost what you want, but requires that each
> individual partition fits in memory (rather than each individual line).
> Hopefully that's sufficient, though.
>
>
> On Fri, Aug 1, 2014 at 1:38 AM, Andrei <faithlessfriend@gmail.com> wrote:
>
>> Is there a way to get iterator from RDD? Something like rdd.collect(),
>> but returning lazy sequence and not single array.
>>
>> Context: I need to GZip processed data to upload it to Amazon S3. Since
>> archive should be a single file, I want to iterate over RDD, writing each
>> line to a local .gz file. File is small enough to fit local disk, but still
>> large enough not to fit into memory.
>>
>
>

Mime
View raw message