spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <>
Subject Streaming partitions to driver for use in .toLocalIterator
Date Wed, 18 Feb 2015 15:09:39 GMT
Hi Spark devs,

I'm creating a streaming export functionality for RDDs and am having some
trouble with large partitions.  The RDD.toLocalIterator() call pulls over a
partition at a time to the driver, and then streams the RDD out from that
partition before pulling in the next one.  When you have large partitions
though, you can OOM the driver, especially when multiple of these exports
are happening in the same SparkContext.

One idea I had was to repartition the RDD so partitions are smaller, but
it's hard to know a priori what the partition count should be, and I'd like
to avoid paying the shuffle cost if possible -- I think repartition to a
higher partition count forces a shuffle.

Is it feasible to rework this so the executor -> driver transfer in
.toLocalIterator is a steady stream rather than a partition at a time?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message