spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24266) Spark client terminates while driver is still running
Date Wed, 09 Oct 2019 10:55:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947550#comment-16947550
] 

Hyukjin Kwon commented on SPARK-24266:
--------------------------------------

In the dev mailing list, it was discussed to resolve JIRAs having EOL releases as affected
versions and not having update for the last year. Please reopen if you can verify this issue
persists in higher versions.

> Spark client terminates while driver is still running
> -----------------------------------------------------
>
>                 Key: SPARK-24266
>                 URL: https://issues.apache.org/jira/browse/SPARK-24266
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Chun Chen
>            Priority: Major
>              Labels: bulk-closed
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties included when
running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: [/var/run/secrets/kubernetes.io/serviceaccount/token].
Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. Mounting Hadoop
specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: N/A
> 	 start time: N/A
> 	 container images: N/A
> 	 phase: Pending
> 	 status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: N/A
> 	 container images: N/A
> 	 phase: Pending
> 	 status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Pending
> 	 status: [ContainerStatus(containerID=null, image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9,
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null,
terminated=null, waiting=ContainerStateWaiting(message=null, reason=PodInitializing, additionalProperties={}),
additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
> 	 pod name: spark-64-293-980-1526021412180-driver
> 	 namespace: tione-603074457
> 	 labels: network -> FLOATINGIP, spark-app-selector -> spark-2843da19c690485b93780ad7992a101e,
spark-role -> driver
> 	 pod uid: 90558303-54e7-11e8-9e64-525400da65d8
> 	 creation time: 2018-05-11T06:50:17Z
> 	 service account name: default
> 	 volumes: spark-local-dir-0-spark-local, spark-init-properties, download-jars-volume,
download-files, spark-init-secret, hadoop-properties, default-token-xvjt9
> 	 node name: tbds-100-98-45-69
> 	 start time: 2018-05-11T06:50:17Z
> 	 container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 phase: Running
> 	 status: [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368,
image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02,
lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}),
name=spark-kubernetes-driver, ready=true, restartCount=0, state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z,
additionalProperties={}), additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),
additionalProperties={})]
> 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
> 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
> 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
> 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
> 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
> 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
> 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
> 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
> 	at okio.RealBufferedSource.require(RealBufferedSource.java:60)
> 	at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
> 	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
> 	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
> 	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
> 	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
> 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
> 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-11T06:50:27Z
> 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished.
> 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f
> {code}
> This is the client log when we encountered this issue. Spark client terminates while
driver is still running. See the last few lines in the log 
> {code}
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 1000 milliseconds
(T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-11T06:50:27Z
> {code}
> After debuging, I add some logs both in spark and kubernetes-client. 
> {code}
> # spark patch
> --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>  
>    override def eventReceived(action: Action, pod: Pod): Unit = {
>      this.pod = Option(pod)
> +    logInfo(s"action $action")
>      action match {
>        case Action.DELETED =>
>          podFinalStatus = POD_FAILED
> @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>    }
>  
>    override def onClose(e: KubernetesClientException): Unit = {
> -    logDebug(s"Stopping watching application $appId with last-observed phase $phase")
> +    logInfo(s"Stopping watching application $appId with last-observed phase $phase,
exception $e")
>      closeWatch()
>    }
> # kubernetes-client
> diff --git a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> index b82b64e6..20587ef0 100644
> --- a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> +++ b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> @@ -129,6 +129,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
>          httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString);
>        }
>      }
> +    logger.info("resourceVersion " + resourceVersion.get());
>  
>      if (this.resourceVersion.get() != null) {
>        httpUrlBuilder.addQueryParameter("resourceVersion", this.resourceVersion.get());
> @@ -214,12 +215,14 @@ public class WatchConnectionManager<T extends HasMetadata, L
extends KubernetesR
>          try {
>            WatchEvent event = readWatchEvent(message);
>            Object object = event.getObject();
> +          logger.info("resourceVersion " + resourceVersion.get());
>            if (object instanceof HasMetadata) {
>              @SuppressWarnings("unchecked")
>              T obj = (T) object;
>              // Dirty cast - should always be valid though
>              String currentResourceVersion = resourceVersion.get();
>              String newResourceVersion = ((HasMetadata) obj).getMetadata().getResourceVersion();
> +            logger.info("currentResourceVersion " + currentResourceVersion + " newResourceVersion
" + newResourceVersion);
>              if (currentResourceVersion == null || currentResourceVersion.compareTo(newResourceVersion)
< 0) {
>                resourceVersion.compareAndSet(currentResourceVersion, newResourceVersion);
>              }
> @@ -244,6 +247,7 @@ public class WatchConnectionManager<T extends HasMetadata, L extends
KubernetesR
>              }
>            } else if (object instanceof Status) {
>              Status status = (Status) object;
> +            logger.info("status " + status.toString());
>  
>              // The resource version no longer exists - this has to be handled by the
caller.
>              if (status.getCode() == HTTP_GONE) {
> {code}
> It clearly print the following root errors. The issue lays in kubernetes-client that
it simply throws the "too old resource version" exception (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266)
to client, but spark client does not handle this.
> We configured etcd to do compack in each hour. If a job runs for more than an hour, it
may encounter this issue.
> {code}
> 18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 32000 milliseconds
(T5)
> 18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111
> 18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, code=410,
details=null, kind=Status, message=too old resource version: 21648111 (21653211), metadata=ListMeta(resourceVersion=null,
selfLink=null, additionalProperties={}), reason=Gone, status=Failure, additionalProperties={})
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching application spark-6bb7b853297a4892b9fb95a3a0ac0b3d
with last-observed phase Running, exception io.fabric8.kubernetes.client.KubernetesClientException:
too old resource version: 21648111 (21653211)
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
> 	 Container name: spark-kubernetes-driver
> 	 Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
> 	 Container state: Running
> 	 Container started at: 2018-05-13T05:21:01Z
> 18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished.
> 18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f
> {code}
> I fixed the issue by hacking kubernetes-client, make it watch without specifying the
resource version. I'm thinking of a better solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message