spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: Does Spark automatically run different stages concurrently when possible?
Date Tue, 20 Jan 2015 03:38:30 GMT
Sean,

A related question. When to persist the RDD after step 2 or after Step
3 (nothing would happen before step 3 I assume)?

On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <sowen@cloudera.com> wrote:
> From the OP:
>
> (1) val lines = Import full dataset using sc.textFile
> (2) val ABonly = Filter out all rows from "lines" that are not of type A or B
> (3) val processA = Process only the A rows from ABonly
> (4) val processB = Process only the B rows from ABonly
>
> I assume that 3 and 4 are actions, or else nothing happens here at all.
>
> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
> after 3, and may even cause 1 and 2 to happen again if nothing is
> persisted.
>
> You can invoke 3 and 4 in parallel on the driver if you like. That's
> fine. But actions are blocking in the driver.
>
>
>
> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidklmlg@hotmail.com> wrote:
>> Hi Jon, I am looking for an answer for a similar question in the doc now, so
>> far no clue.
>>
>> I would need to know what is spark behaviour in a situation like the example
>> you provided, but taking into account also that there are multiple
>> partitions/workers.
>>
>> I could imagine it's possible that different spark workers are not
>> synchronized in terms of waiting for each other to progress to the next
>> step/stage for the partitions of data they get assigned, while I believe in
>> streaming they would wait for the current batch to complete before they
>> start working on a new one.
>>
>> In the code I am working on, I need to make sure a particular step is
>> completed (in all workers, for all partitions) before next transformation is
>> applied.
>>
>> Would be great if someone could clarify or point to these issues in the doc!
>> :-)
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message