spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-22218) spark shuffle services fails to update secret on application re-attempts
Date Mon, 09 Oct 2017 19:58:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-22218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcelo Vanzin resolved SPARK-22218.
------------------------------------
       Resolution: Fixed
         Assignee: Thomas Graves
    Fix Version/s: 2.3.0
                   2.2.1

> spark shuffle services fails to update secret on application re-attempts
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22218
>                 URL: https://issues.apache.org/jira/browse/SPARK-22218
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, YARN
>    Affects Versions: 2.2.1
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>            Priority: Blocker
>             Fix For: 2.2.1, 2.3.0
>
>
> Running on yarn, If you have any application re-attempts using the spark 2.2 shuffle
service, the external shuffle service does not update the credentials properly and the application
re-attempts fail with javax.security.sasl.SaslException. 
> A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager to use
containsKey (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
, which is the proper behavior, the problem is that between application re-attempts it never
removes the key. So when the second attempt starts, the code says it already contains the
key (since the application id is the same) and it doesn't update the secret properly.
> to reproduce this you can run something like a word count and have the directory already
existing.  The first attempt will fail because the output directory exists, the subsequent
attempts will fail with max number of executor failures.   Note that this is assuming the
second and third attempts run on the same node as the first attempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message