spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Re: Spark on K8s resource staging server timeout
Date Thu, 29 Mar 2018 20:26:42 GMT
Hello Jenna,

 

Are there any logs from the resource staging server pod? They might show something interesting.

 

Unfortunately, we haven’t been maintaining the resource staging server because we’ve moved
all of our effort to the main repository instead of the fork. When we consider the submission
of local files in the official release we should probably create a mechanism that’s more
resilient. Using a single HTTP server isn’t ideal – would ideally like something that’s
highly available, replicated, etc.

 

-Matt Cheah

 

From: Jenna Hoole <jenna.hoole@gmail.com>
Date: Thursday, March 29, 2018 at 10:37 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Re: Spark on K8s resource staging server timeout

 

I added overkill high timeouts to the OkHttpClient.Builder() in RetrofitClientFactory.scala
and I don't seem to be timing out anymore. 

 

val okHttpClientBuilder = new OkHttpClient.Builder()

      .dispatcher(dispatcher)

      .proxy(resolvedProxy)

      .connectTimeout(120, TimeUnit.SECONDS)

      .writeTimeout(120, TimeUnit.SECONDS)

      .readTimeout(120, TimeUnit.SECONDS)

 

-Jenna

 

On Tue, Mar 27, 2018 at 10:48 AM, Jenna Hoole <jenna.hoole@gmail.com> wrote:

So I'm running into an issue with my resource staging server that's producing a stacktrace
like Issue 342 [github.com], but I don't think for the same reasons. What's happening is that
every time after I start up a resource staging server, the first job submitted that uses it
will fail with a java.net [java.net].SocketTimeoutException: timeout, and then every subsequent
job will run perfectly. Including with different jars and different users. It's only ever
the first job that fails and it always fails. I know I'm also running into Issue 577 [github.com]
in that it takes about three minutes before the resource staging server is accessible, but
I'm still failing waiting over ten minutes or in one case overnight. And I'm just using the
examples jar, so it's not a super large jar like in Issue 342. 

 

This isn't great for our CI process, so has anyone seen anything like this before or know
how to increase the timeout if it just takes a while on initial contact? Using spark.network.timeout
has no effect.

 

[jhoole@nid00006 spark]$ kubectl get pods | grep jhoole-spark

jhoole-spark-resource-staging-server-64666675c8-w5cdm   1/1       Running                
13m

[jhoole@nid00006 spark]$ kubectl get svc | grep jhoole-spark

jhoole-spark-resource-staging-service               NodePort    10.96.143.55   <none>
       10000:30622/TCP     13m

[jhoole@nid00006 spark]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --conf
spark.app.name [spark.app.name]=spark-pi --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.0.1:30622
[192.168.0.1] ./examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 

2018-03-27 12:30:13 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable

2018-03-27 12:30:13 INFO  UserGroupInformation:966 - Login successful for user jhoole@local
using keytab file /security/secrets/jhoole.keytab

2018-03-27 12:30:14 INFO  HadoopStepsOrchestrator:54 - Hadoop Conf directory: /etc/hadoop/conf

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls to: jhoole

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing view acls groups to: 

2018-03-27 12:30:14 INFO  SecurityManager:54 - Changing modify acls groups to: 

2018-03-27 12:30:14 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui
acls disabled; users  with view permissions: Set(jhoole); groups with view permissions: Set();
users  with modify permissions: Set(jhoole); groups with modify permissions: Set()

Exception in thread "main" java.net [java.net].SocketTimeoutException: timeout

at okio.Okio$4.newTimeoutException(Okio.java:230)

at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)

at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)

at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)

at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)

at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)

at okhttp3.RealCall.execute(RealCall.java:69)

at retrofit2.OkHttpCall.execute(OkHttpCall.java:174)

at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.getTypedResponseResult(SubmittedDependencyUploaderImpl.scala:101)

at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.doUpload(SubmittedDependencyUploaderImpl.scala:97)

at org.apache.spark.deploy.k8s.submit.SubmittedDependencyUploaderImpl.uploadJars(SubmittedDependencyUploaderImpl.scala:70)

at org.apache.spark.deploy.k8s.submit.submitsteps.initcontainer.SubmittedResourcesInitContainerConfigurationStep.configureInitContainer(SubmittedResourcesInitContainerConfigurationStep.scala:48)

at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:43)

at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep$$anonfun$configureDriver$1.apply(InitContainerBootstrapStep.scala:42)

at scala.collection.immutable.List.foreach(List.scala:381)

at org.apache.spark.deploy.k8s.submit.submitsteps.InitContainerBootstrapStep.configureDriver(InitContainerBootstrapStep.scala:42)

at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:102)

at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$1.apply(Client.scala:101)

at scala.collection.immutable.List.foreach(List.scala:381)

at org.apache.spark.deploy.k8s.submit.Client.run(Client.scala:101)

at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:200)

at org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$5.apply(Client.scala:193)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2551)

at org.apache.spark.deploy.k8s.submit.Client$.run(Client.scala:193)

at org.apache.spark.deploy.k8s.submit.Client$.main(Client.scala:213)

at org.apache.spark.deploy.k8s.submit.Client.main(Client.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.net.SocketException: Socket closed

at java.net.SocketInputStream.read(SocketInputStream.java:204)

at java.net.SocketInputStream.read(SocketInputStream.java:141)

at okio.Okio$2.read(Okio.java:139)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)

... 47 more

2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Shutdown hook called

2018-03-27 12:30:24 INFO  ShutdownHookManager:54 - Deleting directory /tmp/uploaded-jars-4c7ca1cf-31d6-4dba-9203-c9a6f1cd4099

 

Thanks,

Jenna

 


Mime
View raw message