spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Created] (SPARK-15419) monotonicallyIncreasingId should use less memory with multiple partitions
Date Thu, 19 May 2016 21:24:12 GMT
Joseph K. Bradley created SPARK-15419:

             Summary: monotonicallyIncreasingId should use less memory with multiple partitions
                 Key: SPARK-15419
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
         Environment: branch-2.0, 1 worker
            Reporter: Joseph K. Bradley

When monotonicallyIncreasingId is used on a DataFrame with many partitions, it uses a very
large amount of memory.

Consider this code:
import org.apache.spark.sql.functions._

// JMAP1: run jmap -histo:live [PID]

val numPartitions = 1000
val df = spark.range(0, 1000000, 1, numPartitions).toDF("vtx")

// JMAP2: run jmap -histo:live [PID]

val df2 = df.withColumn("id", monotonicallyIncreasingId())

// JMAP3: run jmap -histo:live [PID]

Here's how memory usage progresses:
* JMAP1: This is just for calibration.
* JMAP2: No significant change from 1.
* JMAP3: Massive jump: 3048895 Longs, 1039638 Objects, 2007427 Integers, 1002000 org.apache.spark.sql.catalyst.expressions.GenericInternalRow
** None of these had significant numbers of instances in JMAP1/2.

When the indexed DataFrame is used repeatedly afterwards, the driver memory usage keeps increasing
and eventually blows up in my application.

Presumably this memory usage could be reduced.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message