I have a spark sql that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated.
Env: Azure HDinsight spark 2.4 on Azure storage
SQL: Read and Join some data and finally write result to a Hive metastore
Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish.
Left only 1 executor alive
Not sure what's the executor doing:
From time to time, we can tell the shuffle read increased:
Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left.
Release of almost all executor
The sparl.sql ends with below code:
As you can tell I'm new to spark and not 100% getting what's going on here. The huge shuffle spill looks ugly, but they probably not the reason of slow execution - the reason why only 1 executor doing the job it is. Greatly appreciate if you can share how to troubleshot / further look into it. Thank you very much.