spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "1650996069" <>
Subject Reading data slows down when Spark3.0 uses multiple cpu cores
Date Mon, 09 Nov 2020 02:29:06 GMT
Hello, I recently encountered a problem that confuses me when using spark3.0.

I used the tpcx-bb dataset (200GB) and executed Query#5 in it. The SQL will read about 65.7GB
of table data. 

Query#5  is as  follows(


The execution script is:
#! /bin/bash
$Spark_Home/bin/spark-submit \
--class org.tpcxBB \
--master spark:// \
--executor-memory 40g \
--total-executors-cores 48 \

--executor-cores 8 \
--conf spark.task.cpus=1 \
--conf spark.driver.memory=40g \
--conf spark.driver.maxResultSize=60g \
--conf spark.executor.memory=40g \
--conf spark.sql.shuffle.partitions=100 \
/home/runJars/Tpcxbb-1.0.jar "/user/root/benchmarks/data-200g/data" "5"

There are three machines in the spark cluster, each with 56 cpu cores and 360GB of memory
But what is strange is that  when I increase the number of cpu cores, the overall execution
time of  Query#5 is reduced that can be seen from the history server web UI, but  the average
time of tasks (510 in total) responsible for reading data  increases significantly.
When cpu cores=32, the average task time is 7s
When cpu cores=48, the average task time is 17s
When cpu cores=96, the average task time is 20s
Entering  the task log analysis, it is found that reading the same data file, the  task time
consumption increases with the number of cpu cores.
What is the reason for this?

Next,  when I kept the total executors unchanged and still increased the cpu  cores, I found
that the result was the same. Of course, the  executor-memory and driver-memory are enough.
Can you explain the reason for this?
Thank U~
【Excuse me, the same email I sent through on Friday, and it shows that
the sending to is successful, why can’t I see the email I sent at http://apache-spark-developers-list.1001551.n3.nabble.
View raw message