spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Worker hangs with 100% CPU in Standalone cluster
Date Thu, 16 Jan 2014 20:42:55 GMT
It sounds like the takeaway is that if you're using custom classes, you
need to make sure that their hashCode() and equals() methods are
value-based?


On Thu, Jan 16, 2014 at 12:08 PM, Patrick Wendell <pwendell@gmail.com>wrote:

> Thanks for following up and explaining this one! Definitely something
> other users might run into...
>
>
> On Thu, Jan 16, 2014 at 5:58 AM, Grega Kešpret <grega@celtra.com> wrote:
>
>> Just to follow up, we have since pinpointed the problem to be in
>> application code (not Spark). In some cases, there was an infinite loop in
>> Scala HashTable linear probing algorithm, where an element's next() pointed
>> at itself. It was probably caused by wrong hashCode() and equals() methods
>> on the object we were storing.
>>
>> Milos, we also have Master node separate from Worker nodes. Could someone
>> from Spark team comment about that?
>>
>> Grega
>> --
>> [image: Inline image 1]
>> *Grega Kešpret*
>> Analytics engineer
>>
>> Celtra — Rich Media Mobile Advertising
>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>
>>
>> On Thu, Jan 16, 2014 at 2:46 PM, Milos Nikolic <milos.nikolic83@gmail.com
>> > wrote:
>>
>>> Hello,
>>>
>>> I’m facing the same (or similar) problem. In my case, the last two tasks
>>> hang in a map function following sc.sequenceFile(…). It happens from time
>>> to time (more often with TorrentBroadcast than HttpBroadcast) and after
>>> restarting it works fine.
>>>
>>> The problem always happens on the same node — on the node that plays the
>>> roles of the master and one worker. Once this node becomes master-only
>>> (i.e., I removed this nodes from conf/slaves), the problem is gone.
>>>
>>> Does that mean that the master and workers have to be on separate nodes?
>>>
>>> Best,
>>> Milos
>>>
>>>
>>> On Jan 6, 2014, at 5:44 PM, Grega Kešpret <grega@celtra.com> wrote:
>>>
>>> Hi,
>>>
>>> we are seeing several times a day one worker in a Standalone cluster
>>> hang up with 100% CPU at the last task and doesn't proceed. After we
>>> restart the job, it completes successfully.
>>>
>>> We are using Spark v0.8.1-incubating.
>>>
>>> Attached please find jstack logs of Worker
>>> and CoarseGrainedExecutorBackend JVM processes.
>>>
>>> Grega
>>> --
>>> <celtra_logo.png>
>>> *Grega Kešpret*
>>> Analytics engineer
>>>
>>> Celtra — Rich Media Mobile Advertising
>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>  <logs.zip>
>>>
>>>
>>>
>>
>

Mime
View raw message