spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey J. Nolet (JIRA)" <>
Subject [jira] [Created] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
Date Thu, 08 Jan 2015 02:03:34 GMT
Corey J. Nolet created SPARK-5140:

             Summary: Two RDDs which are scheduled concurrently should be able to wait on
parent in all cases
                 Key: SPARK-5140
             Project: Spark
          Issue Type: New Feature
            Reporter: Corey J. Nolet
             Fix For: 1.3.0, 1.2.1

Not sure if this would change too much of the internals to be included in the 1.2.1 but it
would be very helpful if it could be.

This ticket is from a discussion between myself and [~ilikerps]. Here's the result of some
testing that [~ilikerps] did:

I did some testing as well, and it turns out the "wait for other guy to finish caching" logic
is on a per-task basis, and it only works on tasks that happen to be executing on the same

Once a partition is cached, we will schedule tasks that touch that partition on that executor.
The problem here, though, is that the cache is in progress, and so the tasks are still scheduled
randomly (or with whatever locality the data source has), so tasks which end up on different
machines will not see that the cache is already in progress.

Here was my test, by the way:
import scala.concurrent._
import scala.concurrent.duration._

val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(10000); i }).cache()
val futures = (0 until 4).map { _ => Future { rdd.count } }
Await.result(Future.sequence(futures), 120.second)

Note that I run the future 4 times in parallel. I found that the first run has all tasks take
10 seconds. The second has about 50% of its tasks take 10 seconds, and the rest just wait
for the first stage to finish. The last two runs have no tasks that take 10 seconds; all wait
for the first two stages to finish.

What we want is the ability to fire off a job and have the DAG figure out that two RDDs depend
on the same parent so that when the children are scheduled concurrently, the first one to
start will activate the parent and both will wait on the parent. When the parent is done,
they will both be able to finish their work concurrently. We are trying to use this pattern
by having the parent cache results.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message