spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Evans <jeffrey.wayne.ev...@gmail.com>
Subject Re: Possible to limit number of IPC retries on spark-submit?
Date Fri, 31 Jan 2020 21:53:24 GMT
Figured out the answer, eventually.  The magic property name, in this case,
is yarn.client.failover-max-attempts (prefixed with spark.hadoop. in the
case of Spark, of course).  For a full explanation, see the StackOverflow
answer <https://stackoverflow.com/a/60011708/375670> I just added.

On Wed, Jan 22, 2020 at 5:02 PM Jeff Evans <jeffrey.wayne.evans@gmail.com>
wrote:

> Greetings,
>
> Is it possible to limit the number of times the IPC client retries upon a
> spark-submit invocation?  For context, see this StackOverflow post
> <https://stackoverflow.com/questions/59863850/how-to-control-the-number-of-hadoop-ipc-retry-attempts-for-a-spark-job-submissio>.
> In essence, I am trying to call spark-submit on a Kerberized cluster,
> without having valid Kerberos tickets available.  This is deliberate, and
> I'm not truly facing a Kerberos issue.  Rather, this is the
> easiest reproducible case of "long IPC retry" I have been able to trigger.
>
> In this particular case, the following errors are printed (presumably by
> the launcher):
>
> 20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local
exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client
cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2";
destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics
over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
>
> This continues for 30 times before the launcher finally gives up.
>
> As indicated in the answer on that StackOverflow post, the relevant Hadoop
> properties should be ipc.client.connect.max.retries and/or
> ipc.client.connect.max.retries.on.sasl.  However, in testing on Spark
> 2.4.0 (on CDH 6.1), I am not able to get either of these to take effect (it
> still retries 30 times regardless).  I am trying the SparkPi example, and
> specifying them with --conf spark.hadoop.ipc.client.connect.max.retries
> and/or --conf spark.hadoop.ipc.client.connect.max.retries.on.sasl.
>
> Any ideas on what I could be doing wrong, or why I can't get these
> properties to take effect?
>

Mime
View raw message