spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philipp Kraus <philipp.kraus.flashp...@gmail.com>
Subject Spark Executor dies in K8 cluster
Date Wed, 19 May 2021 09:18:18 GMT
Hello,

I have got the following first testing setup:



Kubernetes Cluster 1.20 (4 nodes, each node with 120 GB hard disk, 4 cpus, 40 GB memory)

Spark installation by Binami Helm Charts https://artifacthub.io/packages/helm/bitnami/spark
(Chart Version 5.4.2 / Spark 3.1.1)

using GeoSpark version 1.3.2-SNAPSHOT (not Apache Sedona because of migration issues) with
the setup of https://sedona.apache.org/download/cluster/ so spark.driver.memory 10g, spark.network.timeout
1000s, spark.driver.maxResultSize 5g

Creating a Fat-Jar Spring-Boot Application which runs some Spark algorithms on Java Adoptable
JDK 1.8 (latest docker image) with Spark 3.1.1 and Scala 2.12

Using NFS Server Provisioner as Helm Chart https://artifacthub.io/packages/helm/kvaps/nfs-server-provisioner
to create a ReadWriteMany Volume for the Spark-Workers and application. On all pods this volume
is mounted under /sparkdir so the Fat-Jar file is stored there

Spark Workers are configured with Helm as a ReplicaSet so at 75% CPU usage new worker should
be spawned on default 2 worker pods are running

The Spark master UI shows the workers with the correct memory and cpu resources (4 cores and
10 GB memory for each worker)

Application and Spark are running in the same namespace




I create in the Spring-Boot application (as docker image) a Spark config with (Help release
name „test“):

Final String l_jar = "/sparkdir/myspringapp.jar"
 new SparkConf().setMaster( "spark://test--spark-master-svc:7077" )
                             .setAppName( "mySpringBootApp")
                             .setJars( Stream.of( l_jar ).toArray( String[]::new ) )
                             .set( "spark.jars", l_jar )
                             .set( "spark.driver.userClassPathFirst", l_jar )
                             .set( "spark.kubernetes.container.image", "bitnami/spark:3.1.1"
)
                             .set( "spark.submit.deployMode", "cluster" )
                             .set( "spark.driver.memory", "10G" )
                             .set( "spark.executor.memory", "4G" )
                             .set( "spark.network.timeout", "1000s" )
                             .set( "spark.driver.maxResultSize", "5G" );

If I start the application and run the Spark execution, the master gets the job and pass it
to the workers, this works fine, but on the worker I get an error on the executors, see the
log of one worker:

This script is deprecated, use start-worker.sh
starting org.apache.spark.deploy.worker.Worker, logging to /opt/bitnami/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-test-spark-worker-0.out
Spark Command: /opt/bitnami/java/bin/java -cp /opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*
-Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://test-spark-master-svc:7077
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/05/18 18:56:11 INFO Worker: Started daemon with process name: 41@test-spark-worker-0
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for TERM
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for HUP
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for INT
21/05/18 18:56:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
21/05/18 18:56:11 INFO SecurityManager: Changing view acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing view acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls
disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users
 with modify permissions: Set(spark); groups with modify permissions: Set()
21/05/18 18:56:11 INFO Utils: Successfully started service 'sparkWorker' on port 35561.
21/05/18 18:56:11 INFO Worker: Worker decommissioning not enabled, SIGPWR will result in exiting.
21/05/18 18:56:12 INFO Worker: Starting Spark worker 10.223.130.87:35561 with 4 cores, 10.0
GiB RAM
21/05/18 18:56:12 INFO Worker: Running Spark version 3.1.1
21/05/18 18:56:12 INFO Worker: Spark home: /opt/bitnami/spark
21/05/18 18:56:12 INFO ResourceUtils: ==============================================================
21/05/18 18:56:12 INFO ResourceUtils: No custom resources configured for spark.worker.
21/05/18 18:56:12 INFO ResourceUtils: ==============================================================
21/05/18 18:56:12 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
21/05/18 18:56:12 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://test-spark-worker-0.test-spark-headless.workflow.svc.cluster.local:8081
21/05/18 18:56:12 INFO Worker: Connecting to master test-spark-master-svc:7077...
21/05/18 18:56:12 INFO TransportClientFactory: Successfully created connection to test-spark-master-svc/10.233.8.202:7077
after 31 ms (0 ms spent in bootstraps)
21/05/18 18:56:12 INFO Worker: Successfully registered with master spark://test-spark-master-0.test-spark-headless.workflow.svc.cluster.local:7077



---------- the next lines are shown on all workers in an infinity loop until I kill the application
on the Spark master ----------

21/05/18 20:46:55 INFO Worker: Asked to launch executor app-20210518204655-0000/1 for f212b4b4-05df-4f22-a580-87cbe5fb9356
21/05/18 20:46:55 INFO SecurityManager: Changing view acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing view acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls
disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users
 with modify permissions: Set(spark); groups with modify permissions: Set()

21/05/18 20:46:55 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp"
"/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.network.timeout=1000s"
"-Dspark.driver.port=41904" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url"
"spark://CoarseGrainedScheduler@test-workflowengine-567454667d-6c7p7:41904" "--executor-id"
"1" "--hostname" "10.223.130.87" "--cores" "4" "--app-id" "app-20210518204655-0000" "--worker-url"
"spark://Worker@10.223.130.87:35561“

21/05/18 20:46:56 INFO Worker: Executor app-20210518204655-0000/1 finished with state EXITED
message Command exited with code 1 exitStatus 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files
associated with the finished executor 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210518204655-0000,
execId=1)




I have setup equal structure with Spark in a docker-compose, I’m using equal configuration
values (I use the cluster mode on the docker-compose also), but on the K8 setup the executor
fails and I don’t know how I can find out what goes wrong and how I can fix this issue.
I need please some help to get more information what goes wrong and what I can do to fix this
issue, I don’t know if this is an error on my K8 configuration, the application code for
Spark initialization or an issue on my worker / spark configuration.

Thanks for help

Phil
Mime
View raw message