hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13433) Race in UGI.reloginFromKeytab
Date Wed, 11 Jan 2017 06:07:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817321#comment-15817321

Duo Zhang commented on HADOOP-13433:

Oh, TestRaceWhenRelogin. The code I put here is used to verify that the problem can happen
without moving the tickets manually, so TestRaceWhenRelogin will pass without the fix...

In the new patch it is used to verify that tgt will always be the first ticket after relogin.

And for {{CommonConfigurationKeys.HADOOP_KERBEROS_MIN_SECONDS_BEFORE_RELOGIN}}, yeah we have
shouldRenewImmediatelyForTests now so we do not need to set it anymore.


> Race in UGI.reloginFromKeytab
> -----------------------------
>                 Key: HADOOP-13433
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13433
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2, 2.6.6
>         Attachments: HADOOP-13433-v1.patch, HADOOP-13433-v2.patch, HADOOP-13433-v4.patch,
HADOOP-13433.patch, HBASE-13433-testcase-v3.patch
> This is a problem that has troubled us for several years. For our HBase cluster, sometimes
the RS will be stuck due to
> {noformat}
> 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception encountered
while connecting to the server :
> javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid
credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
>         at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
>         at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
>         at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
>         at org.apache.hadoop.hbase.security.User.call(User.java:607)
>         at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
>         at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
>         at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
>         at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
>         at $Proxy24.replicateLogEntries(Unknown Source)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
> Caused by: GSSException: No valid credentials provided (Mechanism level: The ticket isn't
for us (35) - BAD TGS SERVER NAME)
>         at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
>         at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
>         at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
>         at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
>         ... 23 more
> Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
>         at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64)
>         at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
>         at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
>         at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
>         at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
>         at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594)
>         ... 26 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
>         at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133)
>         at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58)
>         at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53)
>         at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46)
>         ... 31 more‚Äč
> {noformat}
> It rarely happens, but if it happens, the regionserver will be stuck and can never recover.
> Recently we added a log after a successful re-login which prints the private credentials,
and finally catched the direct reason. After a successful re-login, we have two kerberos tickets
in the credentials, one is the TGT, and the other is a service ticket. The strange thing is
that, the service ticket is placed before TGT. This breaks the assumption of jdk's kerberos
library. See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java,
the {{getTgt}} Method
> {code:title=Krb5InitCredential}
>             return AccessController.doPrivileged(
>                 new PrivilegedExceptionAction<KerberosTicket>() {
>                 public KerberosTicket run() throws Exception {
>                     // It's OK to use null as serverPrincipal. TGT is almost
>                     // the first ticket for a principal and we use list.
>                     return Krb5Util.getTicket(
>                         realCaller,
>                         clientPrincipal, null, acc);
>                         }});
> {code}
> So here, the library will use the service ticket as TGT to acquire a service ticket,
and KDC will reject the request since the 'TGT' does not start with 'krbtgt'. And it can never
recover because in UGI, the re-login will check if there is a valid TGT first and no doubt,
we have one...
> This usually happens when a secure connection initialization comes along with the re-login,
and the end time indicates that the service ticket is acquired by the previous TGT. Since
UGI does not prevent doAs and re-login happen at the same time, we believe that there is a
race condition.
> After reading the code, we found a possible race condition.
> See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java,
the {{initSecContext}} method, we will get TGT first, then check if there is already a service
ticket, if not, acquire a service ticket using the TGT, and put it into the credentials.
> And in Krb5LoginModule.logout(the sun version), we will remove the kerberos tickets from
the credentials first, and then destroy them.
> Here comes the race condition. Let T1 be the secure connection set up thread, T2 be the
re-login thread.
> T1: get TGT
> T2: remove all tickets from credentials
> T1: check service ticket, none(since all tickets have been removed)
> T1: acquire a new service ticket using TGT and put it into the credentials
> T2: destroy all tickets
> T2: login, i.e., put a new TGT into the credentials.
> It is hard to write a UT to produce the problem because the racing code is in jdk, which
is not written by us...
> Suggestions are welcomed. Thanks.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message