Thanks for following up and explaining this one! Definitely something other users might run into...


On Thu, Jan 16, 2014 at 5:58 AM, Grega Kešpret <grega@celtra.com> wrote:
Just to follow up, we have since pinpointed the problem to be in application code (not Spark). In some cases, there was an infinite loop in Scala HashTable linear probing algorithm, where an element's next() pointed at itself. It was probably caused by wrong hashCode() and equals() methods on the object we were storing.

Milos, we also have Master node separate from Worker nodes. Could someone from Spark team comment about that?

Grega
--
Inline image 1
Grega Kešpret
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com | @celtramobile


On Thu, Jan 16, 2014 at 2:46 PM, Milos Nikolic <milos.nikolic83@gmail.com> wrote:
Hello,

I’m facing the same (or similar) problem. In my case, the last two tasks hang in a map function following sc.sequenceFile(…). It happens from time to time (more often with TorrentBroadcast than HttpBroadcast) and after restarting it works fine. 

The problem always happens on the same node — on the node that plays the roles of the master and one worker. Once this node becomes master-only (i.e., I removed this nodes from conf/slaves), the problem is gone. 

Does that mean that the master and workers have to be on separate nodes? 

Best,
Milos


On Jan 6, 2014, at 5:44 PM, Grega Kešpret <grega@celtra.com> wrote:

Hi,

we are seeing several times a day one worker in a Standalone cluster hang up with 100% CPU at the last task and doesn't proceed. After we restart the job, it completes successfully.

We are using Spark v0.8.1-incubating.

Attached please find jstack logs of Worker and CoarseGrainedExecutorBackend JVM processes.

Grega
--
<celtra_logo.png>
Grega Kešpret
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com | @celtramobile
<logs.zip>