spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: [spark sql performance] Only 1 executor to write output?
Date Sun, 24 Mar 2019 19:07:03 GMT
Hi,

I think that you are running into data skew scenarios.


Regards,
Gourav

On Sun, Mar 24, 2019 at 3:50 PM Mike Chan <mikechancs@gmail.com> wrote:

> Dear all,
>
> I have a spark sql that used to execute < 10 mins now running at 3 hours
> after a cluster migration and need to deep dive on what it's actually
> doing. I'm new to spark and please don't mind if I'm asking something
> unrelated.
>
> Env: Azure HDinsight spark 2.4 on Azure storage
> SQL: Read and Join some data and finally write result to a Hive metastore
>
> Application Behavior:
> Within the first 15 mins, it loads and complete most tasks (199/200); left
> only 1 executor process alive and continually to shuffle read / write data.
> Because now it only leave 1 executor, we need to wait 3 hours until this
> application finish.
> [image: image.png]
>
> Left only 1 executor alive
> [image: image.png]
>
> Not sure what's the executor doing:
> [image: image.png]
>
> From time to time, we can tell the shuffle read increased:
> [image: image.png]
>
> Therefore I increased the spark.executor.memory to 20g, but nothing
> changed. From Ambari and YARN I can tell the cluster has many resources
> left.
> [image: image.png]
>
> Release of almost all executor
> [image: image.png]
>
> The sparl.sql ends with below code:
> .write.mode("overwrite").saveAsTable("default.mikemiketable")
>
> As you can tell I'm new to spark and not 100% getting what's going on
> here. The huge shuffle spill looks ugly, but they probably not the reason
> of slow execution - the reason why only 1 executor doing the job it is.
> Greatly appreciate if you can share how to troubleshot / further look into
> it. Thank you very much.
>
> Best Regards,
> Mike
>

Mime
View raw message