spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khalid Mammadov <khalidmammad...@gmail.com>
Subject Re: Performance of PySpark jobs on the Kubernetes cluster
Date Tue, 10 Aug 2021 10:19:10 GMT
Hi Mich

I think you need to check your code.
If code does not use PySpark API effectively you may get this. I.e. if you
use pure Python/pandas api rather than Pyspark i.e.
transform->transform->action. e.g df.select(..).withColumn(...)...count()

Hope this helps to put you on right direction.

Cheers
Khalid




On Mon, 9 Aug 2021, 20:20 Mich Talebzadeh, <mich.talebzadeh@gmail.com>
wrote:

> Hi,
>
> I have a basic question to ask.
>
> I am running a Google k8s cluster (AKA GKE) with three nodes each having
> configuration below
>
> e2-standard-2 (2 vCPUs, 8 GB memory)
>
>
> spark-submit is launched from another node (actually a data proc single
> node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call
> this the launch node
>
> OK I know that the cluster is not much but Google was complaining about
> the launch node hitting 100% cpus. So I added two more cpus to it.
>
> It appears that despite using k8s as the computational cluster, the burden
> falls upon the launch node!
>
> The cpu utilisation for launch node shown below
>
> [image: image.png]
> The dip is when 2 more cpus were added to  it so it had to reboot. so
> around %70 usage
>
> The combined cpu usage for GKE nodes is shown below:
>
> [image: image.png]
>
> Never goes above 20%!
>
> I can see that the drive and executors as below:
>
> k get pods -n spark
> NAME                                         READY   STATUS    RESTARTS
>  AGE
> pytest-c958c97b2c52b6ed-driver               1/1     Running   0
> 69s
> randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1     Running   0
> 51s
> randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1     Running   0
> 51s
> randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1     Pending   0
> 51s
>
> It is a PySpark 3.1.1 image using java 8 and pushing random data generated
> into Google BigQuery data warehouse. The last executor (exec-3) seems to be
> just pending. The spark-submit is as below:
>
>         spark-submit --verbose \
>            --properties-file ${property_file} \
>            --master k8s://https://$KUBERNETES_MASTER_IP:443 \
>            --deploy-mode cluster \
>            --name pytest \
>            --conf
> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
>            --py-files $CODE_DIRECTORY/DSBQ.zip \
>            --conf spark.kubernetes.namespace=$NAMESPACE \
>            --conf spark.executor.memory=5000m \
>            --conf spark.network.timeout=300 \
>            --conf spark.executor.instances=3 \
>            --conf spark.kubernetes.driver.limit.cores=1 \
>            --conf spark.driver.cores=1 \
>            --conf spark.executor.cores=1 \
>            --conf spark.executor.memory=2000m \
>            --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
>            --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
>            --conf spark.kubernetes.container.image=${IMAGEGCP} \
>            --conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>            --conf
> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>            --conf
> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
> \
>            --conf spark.sql.execution.arrow.pyspark.enabled="true" \
>            $CODE_DIRECTORY/${APPLICATION}
>
> Aren't the driver and executors running on K8s cluster? So why is the
> launch node heavily used but k8s cluster is underutilized?
>
> Thanks
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Mime
View raw message