spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@cloudera.com>
Subject Re: Spark runs into an Infinite loop even if the tasks are completed successfully
Date Thu, 13 Aug 2015 18:07:15 GMT
oh I see, you are defining your own RDD & Partition types, and you had a
bug where partition.index did not line up with the partitions slot in
rdd.getPartitions.  Is that correct?

On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> I figured that out, And these are my findings:
>
> -> It just enters in an infinite loop when there's a duplicate partition
> id.
>
> -> It enters in an infinite loop when the partition id starts from 1
> rather than 0
>
>
> Something like this piece of code can reproduce it: (in getPartitions())
>
> val total_partitions = 4
> val partitionsArray: Array[Partition] =
> Array.ofDim[Partition](total_partitions)
>
> var i = 0
>
> for(outer <- 0 to 1){
>   for(partition <- 1 to total_partitions){
>     partitionsArray(i) = new DeadLockPartitions(partition)
>     i = i + 1
>   }
> }
>
> partitionsArray
>
>
>
>
> Thanks
> Best Regards
>
> On Wed, Aug 12, 2015 at 10:57 PM, Imran Rashid <irashid@cloudera.com>
> wrote:
>
>> yikes.
>>
>> Was this a one-time thing?  Or does it happen consistently?  can you turn
>> on debug logging for o.a.s.scheduler (dunno if it will help, but maybe ...)
>>
>> On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das <akhil@sigmoidanalytics.com>
>> wrote:
>>
>>> Hi
>>>
>>> My Spark job (running in local[*] with spark 1.4.1) reads data from a
>>> thrift server(Created an RDD, it will compute the partitions in
>>> getPartitions() call and in computes hasNext will return records from these
>>> partitions), count(), foreach() is working fine it returns the correct
>>> number of records. But whenever there is shuffleMap stage (like reduceByKey
>>> etc.) then all the tasks are executing properly but it enters in an
>>> infinite loop saying :
>>>
>>>
>>>    1. 15/08/11 13:05:54 INFO DAGScheduler: Resubmitting ShuffleMapStage
>>>    1 (map at FilterMain.scala:59) because some of its tasks had failed:
>>>    0, 3
>>>
>>>
>>> Here's the complete stack-trace http://pastebin.com/hyK7cG8S
>>>
>>> What could be the root cause of this problem? I looked up and bumped
>>> into this closed JIRA <https://issues.apache.org/jira/browse/SPARK-583>
>>> (which is very very old)
>>>
>>>
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>
>>
>

Mime
View raw message