spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lekshmi Nair (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-32141) Repartition leads to out of memory
Date Wed, 01 Jul 2020 00:30:00 GMT
Lekshmi Nair created SPARK-32141:
------------------------------------

             Summary: Repartition leads to out of memory
                 Key: SPARK-32141
                 URL: https://issues.apache.org/jira/browse/SPARK-32141
             Project: Spark
          Issue Type: Bug
          Components: EC2
    Affects Versions: 2.4.4
            Reporter: Lekshmi Nair


 We  have an application that does aggregation on 7 columns. In order to avoid shuffles
we thought of doing repartition on those 7 columns. It works well with 1 to 4tb of data. When
it gets over 4Tb, it  fails with OOM or disk space.

 

Do we have a better  approach to reduce the shuffle ? For our biggest dataset, the spark
job never ran with repartition.  We are out of options.

 

We do have a 24 node cluster with r5.24X machines and 1TB of disk.  Our shuffle partition
is set to 6912. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message