spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Grove (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
Date Wed, 30 Oct 2019 16:04:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963175#comment-16963175
] 

Andy Grove commented on SPARK-29640:
------------------------------------

A hacky workaround is to wait for DNS to resolve before creating the Spark context:
{code:java}
def waitForDns(): Unit = {
  
  val host = "kubernetes.default.svc"    

  println(s"Resolving $host ...")
  val t1 = System.currentTimeMillis()
  var attempts = 0
  while (System.currentTimeMillis() - t1 < 15000) {
    try {
      attempts += 1
      val address = InetAddress.getByName(host)
      println(s"Resolved $host as ${address.getHostAddress()} after $attempts attempt(s)")
      return
    } catch {
      case _: UnknownHostException =>
        println(s"Failed to resolve $host due to UnknownHostException (attempt $attempts)")
        Thread.sleep(100)
    }
  }
} {code}

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>             Fix For: 2.4.5
>
>
> We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc"
when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921)
in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot
be instantiated
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
> 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
> 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
> 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
> 	at scala.Option.getOrElse(Option.scala:121)
> 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
> 	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
> 	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] 
for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace:
[tenant-8-workflows]  failed.
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> 	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
> 	at scala.Option.map(Option.scala:146)
> 	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
> 	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
> 	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> 	... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
> 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> 	at okhttp3.Dns$1.lookup(Dns.java:39)
> 	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
> 	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
> 	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
> 	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
> 	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
> 	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
> 	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
> 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
> 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
> 	at okhttp3.RealCall.execute(RealCall.java:69)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
> 	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
> 	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
> 	... 27 more  {code}
> This issue seems to be caused by [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod spec ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use TCP mode
for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message