spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivani Rao <raoshiv...@gmail.com>
Subject Re: Hanging Spark jobs
Date Thu, 12 Jun 2014 17:04:26 GMT
I learned this from my co-worker, but it is relevant here.

Spark has lazy evaluation by default, which means that all of your code
does not get executed until you run your "saveAsTextFile", which does not
tell you much about where the problem is occurring. In order to debug this
better, you might want to put in a "saveAsTextFile" after each RDD
operation, so that you can figure out where it is getting stuck.

HTH
Shivani


On Wed, Jun 11, 2014 at 2:17 AM, Daniel Darabos <
daniel.darabos@lynxanalytics.com> wrote:

> These stack traces come from the stuck node? Looks like it's waiting on
> data in BlockFetcherIterator. Waiting for data from another node. But you
> say all other nodes were done? Very curious.
>
> Maybe you could try turning on debug logging, and try to figure out what
> happens in BlockFetcherIterator (
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala).
> I do not think it is supposed to get stuck indefinitely.
>
> On Tue, Jun 10, 2014 at 8:22 PM, Hurwitz, Daniel <dhurwitz@ebay.com>
> wrote:
>
>>  Hi,
>>
>>
>>
>> We are observing a recurring issue where our Spark jobs are hanging for
>> several hours, even days, until we kill them.
>>
>>
>>
>> We are running Spark v0.9.1 over YARN.
>>
>>
>>
>> Our input is a list of edges of a graph on which we use Bagel to compute
>> connected components using the following method:
>>
>>
>>
>> *class* CCMessage(*var* targetId: Long, *var* myComponentId: Long)
>> *extends* Message[Long] *with* Serializable
>>
>> *def* compute(self: CC, msgs: Option[Array[CCMessage]], superstep: Int):
>> (CC, Array[CCMessage]) = {
>>
>>       *val* smallestComponentId = msgs.map(sq => *sq.map(_.*
>> *myComponentId**)*.min).getOrElse(Long.MaxValue)
>>
>>       *val* newComponentId = math.min(self.clusterID, smallestComponentId
>> )
>>
>>       *val* halt = (newComponentId == self.clusterID) || (superstep >=
>> maxIters)
>>
>>       self.active = *if* (superstep == 0) *true* *else* !halt
>>
>>       *val* outGoingMessages = *if* (halt && superstep > 0)
>> Array[CCMessage]()
>>
>>       *else* self.edges.map(targetId => *new* CCMessage(targetId,
>> newComponentId)).toArray
>>
>>       self.clusterID = newComponentId
>>
>>
>>
>>       (self, outGoingMessages)
>>
>> }
>>
>>
>>
>> Our output is a text file in which each line is a list of the node IDs in
>> each component. The size of the output may be up to 6 GB.
>>
>>
>>
>> We see in the job tracker that most of the time jobs usually get stuck on
>> the “saveAsTextFile” command, the final line in our code. In some cases,
>> the job will hang during one of the iterations of Bagel during the
>> computation of the connected components.
>>
>>
>>
>> Oftentimes, when we kill the job and re-execute it, it will finish
>> successfully within an hour which is the expected duration. We notice that
>> if our Spark jobs don’t finish after a few hours, they will never finish
>> until they are killed, regardless of the load on our cluster.
>>
>>
>>
>> After consulting with our Hadoop support team, they noticed that after a
>> particular hanging Spark job was running for 38 hours, all Spark processes
>> on all nodes were completed except for one node which was running more than
>> 9 hours consuming very little CPU, then suddenly consuming 14s of CPU, then
>> back to calm. Also, the other nodes were not relinquishing their resources
>> until our Hadoop admin killed the process on that problematic node and
>> suddenly the job finished and “success” was reported in the job tracker.
>> The output seemed to be fine too. If it helps you understand the issue, the
>> Hadoop admin suggested this was a Spark issue and sent us two stack dumps
>> which I attached to this email: before killing the node’s Spark process
>> (dump1.txt) and after (dump2.txt).
>>
>>
>>
>> Any advice on how to resolve this issue? How can we debug this?
>>
>>  Thanks,
>>
>> ~Daniel
>>
>>
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Mime
View raw message