spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <>
Subject [jira] [Created] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
Date Thu, 09 Feb 2017 07:25:41 GMT
Josh Rosen created SPARK-19529:

             Summary: TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
                 Key: SPARK-19529
             Project: Spark
          Issue Type: Bug
          Components: Shuffle, Spark Core
    Affects Versions: 2.1.0, 2.0.0, 1.6.0
            Reporter: Josh Rosen
            Assignee: Josh Rosen

In Spark's Netty RPC layer, TransportClientFactory.createClient() calls awaitUninterruptibly()
on a Netty future while waiting for a connection to be established. This creates problem when
a Spark task is interrupted while blocking in this call (which can happen in the event of
a slow connection which will eventually time out). This has bad impacts on task cancellation
when interruptOnCancel = true.

As an example of the impact of this problem, I experienced significant numbers of uncancellable
"zombie tasks" on a production cluster where several tasks were blocked trying to connect
to a dead shuffle server and then continued running as zombies after I cancelled the associated
Spark stage. The zombie tasks ran for several minutes with the following stack:

java.lang.Object.wait(Native Method)
=> holding Monitor(java.lang.Object@1849476028})$1.createAndStart(

I believe that we can easily fix this by using the InterruptedException-throwing await() instead.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message