spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Spark with Kubernetes connecting to pod ID, not address
Date Thu, 14 Feb 2019 01:22:12 GMT
Hmm, I’m not asking about using k8s to control Spark as a Job manager or scheduler like Yarn.
We use the built-in standalone Spark Job Manager and sparl://spark-api:7077 as the master
not k8s.

The problem is using k8s to manage a cluster consisting of our app, some databases, and Spark
(one master, one driver, several executors). The problem is that some kind of callback from
Spark is trying to use the pod ID in the callback and is failing to connect because of that.
We have tried deployMode “client” and “cluster” but get the same error

The full trace is below but the important bit is:

    Failed to connect to harness-64d97d6d6-6n7nh:46337

This came from the deployMode = “client: and the port is the driver port, which should be
on the launching pod. For some reason it is using a pod ID instead of a real address. Doesn’t
the driver run in the launching app’s process? The launching app is on the pod ID harness-64d97d6d6-6n7nh
but it has the k8s DNS address of harness-api. I can see the correct address fro the launching
pod with "kubectl get services"


The error is:

Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/bin/java" "-cp" "/spark/conf/:/spark/jars/*:/etc/hadoop/"
"-Xmx1024M" "-Dspark.driver.port=46337" "org.apache.spark.executor.CoarseGrainedExecutorBackend"
"--driver-url" "spark://CoarseGrainedScheduler@harness-64d97d6d6-6n7nh:46337" "--executor-id"
"138" "--hostname" "10.31.31.174" "--cores" "8" "--app-id" "app-20190213210105-0000" "--worker-url"
"spark://Worker@10.31.31.174:37609"
========================================

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	... 4 more
Caused by: java.io.IOException: Failed to connect to harness-64d97d6d6-6n7nh:46337
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: harness-64d97d6d6-6n7nh
	at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
	at java.net.InetAddress.getByName(InetAddress.java:1077)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
	at java.security.AccessController.doPrivileged(Native Method)
	at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
	at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
	at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
	at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
	at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more



From: Erik Erlandson <eerlands@redhat.com>
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel <pat@actionml.com>
Subject:  Re: Spark with Kubernetes connecting to pod id, not address  

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented forum than kube-dev:
https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are initiating your
application from outside the kube cluster). When in client mode, you need to set up a headless
service for the application driver pod that the executors can use to talk back to the driver.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,
Erik


On Wed, Feb 13, 2019 at 1:55 AM Pat Ferrel <pat@actionml.com> wrote:
We have a k8s deployment of several services including Apache Spark. All services seem to
be operational. Our application connects to the Spark master to submit a job using the k8s
DNS service for the cluster where the master is called spark-api so we use master=spark://spark-api:7077 and
we use spark.submit.deployMode=cluster. We submit the job through the API not by the spark-submit
script. 

This will run the "driver" and all "executors" on the cluster and this part seems to work
but there is a callback to the launching code in our app from some Spark process. For some
reason it is trying to connect to harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s
cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think it is the address
of the service that called it. Needless to say any connection to the k8s pod ID fails and
so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same job executed
with the above config tries to connect to the spurious pod ID.
--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor
discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mime
View raw message