hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chaoyu Tang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-16071) Spark remote driver misuses the timeout in RPC handshake
Date Fri, 03 Mar 2017 04:04:45 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893684#comment-15893684
] 

Chaoyu Tang edited comment on HIVE-16071 at 3/3/17 4:04 AM:
------------------------------------------------------------

For the error I mentioned in this JIRA description, it was caused by the [timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L176]
set in HS2 registerClient. It happened when SaslServerHandler and SaslClientHandler were undergoing
handshaking. The timeout caused the registerClient in SparkClientImpl to throw out an error
which interrupted the process calling spark-submit, therefore ending the channel between HS2
and RemoteClient. The channel termination was detected by SaslClientHandler.channelInactive
at RemoteDriver side, which in term invoked SaslClientHandler.dispose(). Therefore we saw
the SASLException with msg "SaslException: Client closed before SASL negotiation finished."I
have managed to reproduce this error by adjusting the hive.spark.client.server.connect.timeout
value to make the HS2 timeout happen during SASL negotiation but RemoteDriver has not reached
its own timeout.

Looking more into the code, I think that the [cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
is not used at all in the code. It never has a chance to effect because it has the same timeout
value as that used for [registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170],
and the latter one always kicks in before it.
Timeout at RemoteDriver side could happen at two places. If it happens when driver [connects|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L110]
back to HS2, HS2 could not detect this timeout error at driver site and has to wait until
its own timeout set in  [registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170]
effects. If the [timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L122]
happens during SASL handshaking, the RemoteDriver main will exit abnormally. The SaslServerHandler.channelInactive
at HS2 side could detect this channel termination and invokes the SaslServerHandler.dispose,
which in term cancels this cancelTask (not be used again). Depending on the stage where HS2
is at (see following code snippet) 
{code}
    protected void onError(Throwable error) {
      cancelTask.cancel(true);
      if (client != null) {
        client.timeoutFuture.cancel(true);
        if (!client.promise.isDone()) {
          client.promise.setFailure(error);
        }
      }
    }
{code}
HS2 should either wait until its hive.spark.client.server.connect.timeout when clientInfo
is null, or terminates the process immediately.
So the [cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
is currently useless in the code unless its timeout could be set with a different (shorter)
value than that set for registerClient (hive.spark.client.server.connect.timeout). Or we can
consider removing it though its existence in current code does not do any harm either.



was (Author: ctang.ma):
For the error I mentioned in this JIRA description, it was caused by the [timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L176]
set in HS2 registerClient. It happened when SaslServerHandler and SaslClientHandler were undergoing
handshaking. The registerClient in SparkClientImpl threw out an error which interrupted the
process calling spark-submit, therefore ending the channel between HS2 and RemoteClient. The
channel termination was detected by SaslClientHandler.channelInactive at RemoteDriver, which
in term invoked SaslClientHandler.dispose(), therefore we saw the SASLException at RemoteDriver
side with msg "SaslException: Client closed before SASL negotiation finished." I have managed
to reproduce this error by adjusting the hive.spark.client.server.connect.timeout value to
make the HS2 timeout happen during SASL negotiation but RemoteDriver has not reached its own
timeout. 
Looking more into the code, I think that the [cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
never has a chance to effect. It has the same timeout value as that used for [registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170],
and the latter one always kicks in before it.
Timeout at RemoteDriver side could happen at two places. If it happens when driver [connects|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L110]
back to HS2, HS2 could not detect this timeout error and has to wait until its own timeout
set in  [registerClient|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L170]
effects. If the [timeout|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L122]
happens during SASL handshaking, the RemoteDriver main will exit abnormally. The SaslServerHandler.channelInactive
at HS2 side could detect this channel termination and invokes the SaslServerHandler.dispose,
which in term cancels this cancelTask. Depending on the stage where HS2 is at (see following
code snippet) 
{code}
    protected void onError(Throwable error) {
      cancelTask.cancel(true);
      if (client != null) {
        client.timeoutFuture.cancel(true);
        if (!client.promise.isDone()) {
          client.promise.setFailure(error);
        }
      }
    }
{code}
HS2 has to wait until its hive.spark.client.server.connect.timeout if the clientInfo is null,
or the process could terminate immediately.
So the [cancelTask|https://github.com/apache/hive/blob/master/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L100]
is not used in the code unless its timeout is set with a different value shorter than that
set for registerClient (hive.spark.client.server.connect.timeout).


> Spark remote driver misuses the timeout in RPC handshake
> --------------------------------------------------------
>
>                 Key: HIVE-16071
>                 URL: https://issues.apache.org/jira/browse/HIVE-16071
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Chaoyu Tang
>            Assignee: Chaoyu Tang
>         Attachments: HIVE-16071.patch
>
>
> Based on its property description in HiveConf and the comments in HIVE-12650 (https://issues.apache.org/jira/browse/HIVE-12650?focusedCommentId=15128979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15128979),
hive.spark.client.connect.timeout is the timeout when the spark remote driver makes a socket
connection (channel) to RPC server. But currently it is also used by the remote driver for
RPC client/server handshaking, which is not right. Instead, hive.spark.client.server.connect.timeout
should be used and it has already been used by the RPCServer in the handshaking.
> The error like following is usually caused by this issue, since the default hive.spark.client.connect.timeout
value (1000ms) used by remote driver for handshaking is a little too short.
> {code}
> 17/02/20 08:46:08 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException:
javax.security.sasl.SaslException: Client closed before SASL negotiation finished.
> java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed
before SASL negotiation finished.
>         at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>         at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:156)
>         at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
> Caused by: javax.security.sasl.SaslException: Client closed before SASL negotiation finished.
>         at org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:453)
>         at org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message