spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Does Spark automatically run different stages concurrently when possible?
Date Mon, 19 Jan 2015 11:47:06 GMT
>From the OP:

(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out all rows from "lines" that are not of type A or B
(3) val processA = Process only the A rows from ABonly
(4) val processB = Process only the B rows from ABonly

I assume that 3 and 4 are actions, or else nothing happens here at all.

When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
after 3, and may even cause 1 and 2 to happen again if nothing is

You can invoke 3 and 4 in parallel on the driver if you like. That's
fine. But actions are blocking in the driver.

On Mon, Jan 19, 2015 at 8:21 AM, davidkl <> wrote:
> Hi Jon, I am looking for an answer for a similar question in the doc now, so
> far no clue.
> I would need to know what is spark behaviour in a situation like the example
> you provided, but taking into account also that there are multiple
> partitions/workers.
> I could imagine it's possible that different spark workers are not
> synchronized in terms of waiting for each other to progress to the next
> step/stage for the partitions of data they get assigned, while I believe in
> streaming they would wait for the current batch to complete before they
> start working on a new one.
> In the code I am working on, I need to make sure a particular step is
> completed (in all workers, for all partitions) before next transformation is
> applied.
> Would be great if someone could clarify or point to these issues in the doc!
> :-)
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message