spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Tso-Guillen <v...@paxata.com>
Subject Re: Scheduler hang?
Date Thu, 26 Feb 2015 21:37:11 GMT
Okay I confirmed my suspicions of a hang. I made a request that stopped
progressing, though the already-scheduled tasks had finished. I made a
separate request that was small enough not to hang, and it kicked the hung
job enough to finish. I think what's happening is that the scheduler or the
local backend is not kicking the revive offers messaging at the right time,
but I have to dig into the code some more to nail the culprit. Anyone on
these list have experience in those code areas that could help?

On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen <vtso@paxata.com> wrote:

> Thanks for the link. Unfortunately, I turned on rdd compression and
> nothing changed. I tried moving netty -> nio and no change :(
>
> On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
>> Not many that i know of, but i bumped into this one
>> https://issues.apache.org/jira/browse/SPARK-4516
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen <vtso@paxata.com>
>> wrote:
>>
>>> Is there any potential problem from 1.1.1 to 1.2.1 with shuffle
>>> dependencies that produce no data?
>>>
>>> On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen <vtso@paxata.com>
>>> wrote:
>>>
>>>> The data is small. The job is composed of many small stages.
>>>>
>>>> * I found that with fewer than 222 the problem exhibits. What will be
>>>> gained by going higher?
>>>> * Pushing up the parallelism only pushes up the boundary at which the
>>>> system appears to hang. I'm worried about some sort of message loss or
>>>> inconsistency.
>>>> * Yes, we are using Kryo.
>>>> * I'll try that, but I'm again a little confused why you're
>>>> recommending this. I'm stumped so might as well?
>>>>
>>>> On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das <akhil@sigmoidanalytics.com
>>>> > wrote:
>>>>
>>>>> What operation are you trying to do and how big is the data that you
>>>>> are operating on?
>>>>>
>>>>> Here's a few things which you can try:
>>>>>
>>>>> - Repartition the RDD to a higher number than 222
>>>>> - Specify the master as local[*] or local[10]
>>>>> - Use Kryo Serializer (.set("spark.serializer",
>>>>> "org.apache.spark.serializer.KryoSerializer"))
>>>>> - Enable RDD Compression (.set("spark.rdd.compress","true") )
>>>>>
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen <vtso@paxata.com>
>>>>> wrote:
>>>>>
>>>>>> I'm getting this really reliably on Spark 1.2.1. Basically I'm in
>>>>>> local mode with parallelism at 8. I have 222 tasks and I never seem
to get
>>>>>> far past 40. Usually in the 20s to 30s it will just hang. The last
logging
>>>>>> is below, and a screenshot of the UI.
>>>>>>
>>>>>> 2015-02-25 20:39:55.779 GMT-0800 INFO  [task-result-getter-3]
>>>>>> TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612
ms on
>>>>>> localhost (1/5)
>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
>>>>>> worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492
bytes
>>>>>> result sent to driver
>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO  [Executor task launch
>>>>>> worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492
bytes
>>>>>> result sent to driver
>>>>>> 2015-02-25 20:39:55.831 GMT-0800 INFO  [task-result-getter-0]
>>>>>> TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670
ms on
>>>>>> localhost (2/5)
>>>>>> 2015-02-25 20:39:55.836 GMT-0800 INFO  [task-result-getter-1]
>>>>>> TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674
ms on
>>>>>> localhost (3/5)
>>>>>> 2015-02-25 20:39:55.891 GMT-0800 INFO  [Executor task launch
>>>>>> worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492
bytes
>>>>>> result sent to driver
>>>>>> 2015-02-25 20:39:55.896 GMT-0800 INFO  [task-result-getter-2]
>>>>>> TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740
ms on
>>>>>> localhost (4/5)
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>> What should I make of this? Where do I start?
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message