spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: Does Spark automatically run different stages concurrently when possible?
Date Wed, 21 Jan 2015 00:38:26 GMT
A map followed by a filter will not be two stages, but rather one stage that pipelines the
map and filter.


> On Jan 20, 2015, at 10:26 AM, Kane Kim <kane.isturm@gmail.com> wrote:
> 
> Related question - is execution of different stages optimized? I.e.
> map followed by a filter will require 2 loops or they will be combined
> into single one?
> 
>> On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay <btiernay@hotmail.com> wrote:
>> I found the following to be a good discussion of the same topic:
>> 
>> http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
>> 
>> 
>>> From: sowen@cloudera.com
>>> Date: Tue, 20 Jan 2015 10:02:20 +0000
>>> Subject: Re: Does Spark automatically run different stages concurrently
>>> when possible?
>>> To: paliwalashish@gmail.com
>>> CC: davidklmlg@hotmail.com; user@spark.apache.org
>> 
>>> 
>>> You can persist the RDD in (2) right after it is created. It will not
>>> cause it to be persisted immediately, but rather the first time it is
>>> materialized. If you persist after (3) is calculated, then it will be
>>> re-calculated (and persisted) after (4) is calculated.
>>> 
>>>> On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalashish@gmail.com> wrote:
>>>> Sean,
>>>> 
>>>> A related question. When to persist the RDD after step 2 or after Step
>>>> 3 (nothing would happen before step 3 I assume)?
>>>> 
>>>>> On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <sowen@cloudera.com>
wrote:
>>>>> From the OP:
>>>>> 
>>>>> (1) val lines = Import full dataset using sc.textFile
>>>>> (2) val ABonly = Filter out all rows from "lines" that are not of type
>>>>> A or B
>>>>> (3) val processA = Process only the A rows from ABonly
>>>>> (4) val processB = Process only the B rows from ABonly
>>>>> 
>>>>> I assume that 3 and 4 are actions, or else nothing happens here at all.
>>>>> 
>>>>> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
>>>>> after 3, and may even cause 1 and 2 to happen again if nothing is
>>>>> persisted.
>>>>> 
>>>>> You can invoke 3 and 4 in parallel on the driver if you like. That's
>>>>> fine. But actions are blocking in the driver.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidklmlg@hotmail.com>
>>>>> wrote:
>>>>>> Hi Jon, I am looking for an answer for a similar question in the
doc
>>>>>> now, so
>>>>>> far no clue.
>>>>>> 
>>>>>> I would need to know what is spark behaviour in a situation like
the
>>>>>> example
>>>>>> you provided, but taking into account also that there are multiple
>>>>>> partitions/workers.
>>>>>> 
>>>>>> I could imagine it's possible that different spark workers are not
>>>>>> synchronized in terms of waiting for each other to progress to the
>>>>>> next
>>>>>> step/stage for the partitions of data they get assigned, while I
>>>>>> believe in
>>>>>> streaming they would wait for the current batch to complete before
>>>>>> they
>>>>>> start working on a new one.
>>>>>> 
>>>>>> In the code I am working on, I need to make sure a particular step
is
>>>>>> completed (in all workers, for all partitions) before next
>>>>>> transformation is
>>>>>> applied.
>>>>>> 
>>>>>> Would be great if someone could clarify or point to these issues
in
>>>>>> the doc!
>>>>>> :-)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>> 
>>>> 
>>>> 
>>>> --
>>>> thanks
>>>> ashish
>>>> 
>>>> Blog: http://www.ashishpaliwal.com/blog
>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message