From Mich Talebzadeh <>
Subject Performance of PySpark jobs on the Kubernetes cluster
Date Mon, 09 Aug 2021 19:20:03 GMT

I have a basic question to ask.

I am running a Google k8s cluster (AKA GKE) with three nodes each having
configuration below

e2-standard-2 (2 vCPUs, 8 GB memory)

spark-submit is launched from another node (actually a data proc single
node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call
this the launch node

OK I know that the cluster is not much but Google was complaining about the
launch node hitting 100% cpus. So I added two more cpus to it.

It appears that despite using k8s as the computational cluster, the burden
falls upon the launch node!

The cpu utilisation for launch node shown below

[image: image.png]
The dip is when 2 more cpus were added to  it so it had to reboot. so
around %70 usage

The combined cpu usage for GKE nodes is shown below:

[image: image.png]

Never goes above 20%!

I can see that the drive and executors as below:

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS
pytest-c958c97b2c52b6ed-driver               1/1     Running   0
randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1     Running   0
randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1     Running   0
randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1     Pending   0

It is a PySpark 3.1.1 image using java 8 and pushing random data generated
into Google BigQuery data warehouse. The last executor (exec-3) seems to be
just pending. The spark-submit is as below:

        spark-submit --verbose \
           --properties-file ${property_file} \
           --master k8s://https://$KUBERNETES_MASTER_IP:443 \
           --deploy-mode cluster \
           --name pytest \
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
           --py-files $CODE_DIRECTORY/ \
           --conf spark.kubernetes.namespace=$NAMESPACE \
           --conf spark.executor.memory=5000m \
           --conf \
           --conf spark.executor.instances=3 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.driver.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=2000m \
           --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
           --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
           --conf spark.kubernetes.container.image=${IMAGEGCP} \
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
           --conf spark.sql.execution.arrow.pyspark.enabled="true" \

Aren't the driver and executors running on K8s cluster? So why is the
launch node heavily used but k8s cluster is underutilized?


