spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lekshmi Nair (Jira)" <>
Subject [jira] [Created] (SPARK-32141) Repartition leads to out of memory
Date Wed, 01 Jul 2020 00:30:00 GMT
Lekshmi Nair created SPARK-32141:

             Summary: Repartition leads to out of memory
                 Key: SPARK-32141
             Project: Spark
          Issue Type: Bug
          Components: EC2
    Affects Versions: 2.4.4
            Reporter: Lekshmi Nair

 We  have an application that does aggregation on 7 columns. In order to avoid shuffles
we thought of doing repartition on those 7 columns. It works well with 1 to 4tb of data. When
it gets over 4Tb, it  fails with OOM or disk space.


Do we have a better  approach to reduce the shuffle ? For our biggest dataset, the spark
job never ran with repartition.  We are out of options.


We do have a 24 node cluster with r5.24X machines and 1TB of disk.  Our shuffle partition
is set to 6912. 

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message