tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sungwoo (Jira)" <j...@apache.org>
Subject [jira] [Created] (TEZ-4334) Fix deadlock in ShuffleScheduler
Date Fri, 03 Sep 2021 03:20:00 GMT
Sungwoo created TEZ-4334:
----------------------------

             Summary: Fix deadlock in ShuffleScheduler
                 Key: TEZ-4334
                 URL: https://issues.apache.org/jira/browse/TEZ-4334
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Sungwoo


Deadlock can be generated between a thread calling ShuffleScheduler.close() and the ShufflePenaltyReferee
thread.

Example (produced with an earlier version):

{code}
"Fetcher_O { attempt_1611850856294_0026_1_03_000000_0_10344 Reducer_3} #13" #2669 daemon prio=5
os_prio=0 tid=0x00002b9de869d000 nid=0xf99 in Object.wait() [0x00002b9de4983000]
 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:481)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:352)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:343)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:407)
        - locked <0x00002b96bbb9d7a8> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1033)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:781)
        - locked <0x00002b96b98a7860> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:414)

"ShufflePenaltyReferee {Reducer_3}" #2645 daemon prio=5 os_prio=0 tid=0x00002b9560fae800 nid=0xf7d
waiting for monitor entry [0x00002b9de733b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1322)
        - waiting to lock <0x00002b96b98a7860> (a org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
{/code}

We can fix the deadlock with:

1) do not hold ShuffleScheduler.this when calling exceptionReporter.reportException()
2) remove synchronized in copyFailed()




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message