spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Tajeldin EDU <>
Subject Re: Incremental load of RDD from HDFS?
Date Tue, 20 Oct 2015 17:55:36 GMT
I could be misreading the code, but looking at the code for toLocalIterator (copied below),
it should lazily call runJob on each partition in your input.  It shouldn't be parsing the
entire RDD before returning from the first "next" call. If it is taking a long time on the
first "next" call, it means your first partition is very large or the runJob overhead is very
large.  If it is due to partition size, perhaps you can repartition your data into smaller
blocks (more partitions).

def toLocalIterator: Iterator[T] = withScope {
  def collectPartition(p: Int): Array[T] = {
    sc.runJob(this, (iter: Iterator[T]) => iter.toArray, Seq(p), allowLocal = false).head
  (0 until partitions.length).iterator.flatMap(i => collectPartition(i))


On Oct 20, 2015, at 9:23 AM, Chris Spagnoli <> wrote:

> I am new to Spark, and this user community, so my apologies if this was
> answered elsewhere and I missed it (I did try search first).
> We have multiple large RDDs stored across a HDFS via Spark (by calling
> pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
> given RDD (by calling sc.newAPIHadoopFile()) and return that data to our
> application.  In order to manage the data returned, we are calling
> rdd.toLocalIterator() followed by successive rddIt.hasNext() and
> calls to return the data to our application one row at a time. 
> This works pretty well.
> But we are observing that the first rddIt.hasNext() or
> invocation will block until the entire RDD is read from HDFS, and this can
> cause a considerable delay for a larger RDD.  Within our application we may
> end up only iterating over the first few hundred rows of data, so it is more
> important to us to be able to get those initial rows back as quickly as
> possible, rather than having to wait for the entire RDD to be loaded from
> HDFS.  Waiting for only the first partition to complete loading before
> starting to get data back would be fine, or even the first few partitions.
> The only solution I could come with would be to save each of our large RDDs
> as a collection of smaller sub-RDDs, and then load those sub-RDDs from HDFS
> sequentially into our application.  But that seems silly.
> Is there any approach using Spark which can start returning data from a
> large RDD before it is completely loaded from HDFS?
> Thanks in advance...
> - Chris
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message