spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <>
Subject Re: Strange DAG scheduling behavior on currently dependent RDDs
Date Wed, 07 Jan 2015 15:42:22 GMT
I asked this question too soon. I am caching off a bunch of RDDs in a
TrieMap so that our framework can wire them together and the locking was
not completely correct- therefore it was creating multiple new RDDs at
times instead of using cached versions- which were creating completely
separate lineages.

What's strange is that this bug only surfaced when I updated Spark.

On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet <> wrote:

> We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
> that we've been developing that connects various different RDDs together
> based on some predefined business cases. After updating to 1.2.0, some of
> the concurrency expectations about how the stages within jobs are executed
> have changed quite significantly.
> Given 3 RDDs:
> RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache()
> RDD2 = RDD1.outputToFile
> RDD3 = RDD1.groupBy().outputToFile
> In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage
> encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and
> RDD3 to both block waiting for RDD1 to complete and cache- at which point
> RDD2 and RDD3 both use the cached version to complete their work.
> Spark 1.2.0 seems to schedule two (be it concurrently running) stages for
> each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each
> get run twice). It does not look like there is any sharing of the results
> between these jobs.
> Are we doing something wrong? Is there a setting that I'm not
> understanding somewhere?

View raw message