Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

Nope, it's a batch job. 

Is it a streaming job?

I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(
at org.apache.spark.memory.MemoryConsumer.allocateArray(
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

