After a Spark SQL job appending a few columns using window aggregation
functions, and performing a join and some data massaging, I find that the
cleanup after the job finishes saving the result data to disk takes as long
if not longer than the job.
I currently am performing window aggregation on a dataset ~150 GB and
joining with another dataset of about ~50 GB.
With window aggregation, it takes about 15 minutes. Without window
aggregation and instead performing a standard groupBy(..).agg(...) and
join, it takes about 19 minutes.
However, when using window aggregation functions, for more than 15-20
minutes, the driver program is removing broadcast pieces, cleaning
accumulators, and cleaning shuffles.
Can anyone explain what these are at a lower level besides what I see on
the command line, or why this happens ONLY when I use window aggregation?
And are there any ways to remedy this?
Thank you!
Jestin
|