spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sparrow <do...@celtra.com>
Subject Re: Spark worker threads waiting
Date Fri, 21 Mar 2014 12:33:26 GMT
Here is the stage overview:
[image: Inline image 2]

and here are the stage details for stage 0:
[image: Inline image 1]
Transformations from first stage to the second one are trivial, so that
should not be the bottle neck (apart from keyBy().groupByKey() that causes
the shuffle write/read).

Kind regards, Domen



On Thu, Mar 20, 2014 at 8:38 PM, Mayur Rustagi [via Apache Spark User List]
<ml-node+s1001560n2962h61@n3.nabble.com> wrote:

> I would have preferred the stage window details & aggregate task
> details(above the task list).
> Basically if you run a job , it translates to multiple stages, each stage
> translates to multiple tasks (each run on worker core).
> So some breakup like
> my job is taking 16 min
> 3 stages , stage 1 : 5 min Stage 2: 10 min & stage 3:1 min
> in Stage 2 give me task aggregate screenshot which talks about 50
> percentile, 75 percentile & 100 percentile.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Mar 20, 2014 at 9:55 AM, sparrow <[hidden email]<http://user/SendEmail.jtp?type=node&node=2962&i=0>
> > wrote:
>
>>
>> This is what the web UI looks like:
>> [image: Inline image 1]
>>
>> I also tail all the worker logs and theese are the last entries before
>> the waiting begins:
>>
>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> maxBytesInFlight: 50331648, minRequest: 10066329
>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Getting 29853 non-zero-bytes blocks out of 37714 blocks
>> 14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
>> Started 5 remote gets in  62 ms
>> [PSYoungGen: 12464967K->3767331K(10552192K)]
>> 36074093K->29053085K(44805696K), 0.6765460 secs] [Times: user=5.35
>> sys=0.02, real=0.67 secs]
>> [PSYoungGen: 10779466K->3203826K(9806400K)]
>> 35384386K->31562169K(44059904K), 0.6925730 secs] [Times: user=5.47
>> sys=0.00, real=0.70 secs]
>>
>> From the screenshot above you can see that task take ~ 6 minutes to
>> complete. The amount of time it takes the tasks to complete seems to depend
>> on the amount of input data. If s3 input string captures 2.5 times less
>> data (less data to shuffle write  and later read), same tasks take 1
>> minute. Any idea how to debug what the workers are doing?
>>
>> Domen
>>
>> On Wed, Mar 19, 2014 at 5:27 PM, Mayur Rustagi [via Apache Spark User
>> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=2938&i=0>
>> > wrote:
>>
>>> You could have some outlier task that is preventing the next set of
>>> stages from launching. Can you check out stages state in the Spark WebUI,
>>> is any task running or is everything halted.
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="<a
>>> href="tel:%2B17602033257" value="+17602033257" target="_blank">
>>> +17602033257" target="_blank"><a
>>> href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257"
>>> target="_blank">+1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 19, 2014 at 5:40 AM, Domen Grabec <[hidden email]<http://user/SendEmail.jtp?type=node&node=2882&i=0>
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a cluster with 16 nodes, each node has 69Gb ram (50GB goes to
>>>> spark) and 8 cores running spark 0.8.1. I have a groupByKey operation that
>>>> causes a wide RDD dependency so shuffle write and shuffle read are
>>>> performed.
>>>>
>>>> For some reason all worker threads seem to sleep for about 3-4 minutes
>>>> each time performing a shuffle read and completing a set of tasks. See
>>>> graphs below how no resources are being utilized in specific time windows.
>>>>
>>>> Each time 3-4 minutes pass, a next set of tasks are being grabbed and
>>>> processed, and then another waiting period happens.
>>>>
>>>> Each task has an input of 80Mb +- 5Mb data to shuffle read.
>>>>
>>>>  [image: Inline image 1]
>>>>
>>>> Here <http://pastebin.com/UHWMdTRY> is a link to thread dump performed
>>>> in the middle of the waiting period. Any idea what could cause the long
>>>> waits?
>>>>
>>>> Kind regards, Domen
>>>>
>>>
>>>
>>>
>>> ------------------------------
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2882.html
>>>  To start a new topic under Apache Spark User List, email [hidden email]<http://user/SendEmail.jtp?type=node&node=2938&i=1>
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: Spark worker threads waiting<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2938.html>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2962.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h70@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=ZG9tZW5AY2VsdHJhLmNvbXwxfC01NjUwMzk2ODU=>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


stageDetails.png (30K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/2988/0/stageDetails.png>
stages.png (80K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/2988/1/stages.png>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-worker-threads-waiting-tp2859p2988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message