spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei <>
Subject Iterator over RDD in PySpark
Date Fri, 01 Aug 2014 08:38:05 GMT
Is there a way to get iterator from RDD? Something like rdd.collect(), but
returning lazy sequence and not single array.

Context: I need to GZip processed data to upload it to Amazon S3. Since
archive should be a single file, I want to iterate over RDD, writing each
line to a local .gz file. File is small enough to fit local disk, but still
large enough not to fit into memory.

View raw message