spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dave-anderson <>
Subject paging through an RDD that's too large to collect() all at once
Date Fri, 19 Sep 2014 02:58:06 GMT
I have an RDD on the cluster that I'd like to iterate over and perform some
operations on each element (push data from each element to another
downstream system outside of Spark).  I'd like to do this at the driver so I
can throttle the rate that I push to the downstream system (as opposed to
submitting a job to the Spark cluster and parallelizing the work - and
likely flooding the downstream system).

The RDD is too big to collect() all at once back into the memory space of
the driver.  Ideally I'd like to be able to "page" through the dataset,
iterating through a chunk of n RDD elements at a time back at the driver. 
It doesn't have to be _exactly_ n elements, just a reasonably small set of
elements at a time.

Is there a simple way to do this?

It looks like I could use RDD.filter() or RDD.collect[U](f:
PartialFunction[T, U]).  Either of those techniques requires defining a
function that will filter the RDD.  But the shape of the data in the RDD
could be such that, for a given function (say splitting by timestamp by hour
of day), it won't reliably split up into reasonably sized "pages".  Also, it
requires doing some analysis to determine boundaries on the data for
filtering.  Point is, it's extra logic

Any thoughts on if there's a simpler way to page through RDD elements back
at the driver?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message