spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Grove (Jira)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
Date Thu, 12 Dec 2019 10:42:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andy Grove updated SPARK-29640:
-------------------------------
    Comment: was deleted

(was: We were finally able to get to a root cause on this so I'm documenting it here in the
hopes that it helps someone else in the future.

The issue was due to the way that routing was set up on our EKS clusters combined with the
fact that we were using an NLB rather than ELB along with nginx ingress controllers.

Specifically, NLB does not support "hairpinning" as explained in [https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html]

In layman's terms, if pod A tries to communicate with pod B, and both pods are on the same
node and the request egresses from the node and is then routed back to the node via NLB and
nginx controller then the request can never succeed and will time out.

Switching to an ELB resolves the issue but a better solution is to use cluster local addressing
so that communicate between pods on the same nodes uses the local network.)

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc"
when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921)
in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot
be instantiated
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
> 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
> 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
> 	at scala.Option.getOrElse(Option.scala:121)
> 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
> 	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
> 	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] 
for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace:
[tenant-8-workflows]  failed.
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
> 	at scala.Option.map(Option.scala:146)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
> 	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> 	... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
> 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> 	at okhttp3.Dns$1.lookup(Dns.java:39)
> 	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
> 	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
> 	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
> 	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
> 	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
> 	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
> 	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> 	at okhttp3.RealCall.execute(RealCall.java:69)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
> 	... 27 more  {code}
> This issue seems to be caused by [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod spec ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use TCP mode
for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message