spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-15317) JobProgressListener takes a huge amount of memory with iterative DataFrame program in local, standalone
Date Sat, 14 May 2016 00:45:13 GMT

     [ https://issues.apache.org/jira/browse/SPARK-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-15317:
--------------------------------------
    Attachment: compare-2.0-10Kpartitions.png
                compare-2.0-16partitions.png
                compare-1.6-10Kpartitions.png

> JobProgressListener takes a huge amount of memory with iterative DataFrame program in
local, standalone
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15317
>                 URL: https://issues.apache.org/jira/browse/SPARK-15317
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: Spark 2.0, local mode + standalone mode on MacBook Pro OSX 10.9
>            Reporter: Joseph K. Bradley
>         Attachments: cc_traces.txt, compare-1.6-10Kpartitions.png, compare-2.0-10Kpartitions.png,
compare-2.0-16partitions.png, dump-standalone-2.0-1of4.png, dump-standalone-2.0-2of4.png,
dump-standalone-2.0-3of4.png, dump-standalone-2.0-4of4.png
>
>
> h2. TL;DR
> Running a small test locally, I found JobProgressListener consuming a huge amount of
memory.  There are many tasks being run, but it is still surprising.  Summary, with details
below:
> * Spark app: series of DataFrame joins
> * Issue: GC
> * Heap dump shows JobProgressListener taking 150 - 400MB, depending on the Spark mode/version
> h2. Reproducing this issue
> h3.  With more complex code
> The code which fails:
> * Here is a branch with the code snippet which fails: [https://github.com/jkbradley/spark/tree/18836174ab190d94800cc247f5519f3148822dce]
> ** This is based on Spark commit hash: bb1362eb3b36b553dca246b95f59ba7fd8adcc8a
> * Look at {{CC.scala}}, which implements connected components using DataFrames: [https://github.com/jkbradley/spark/blob/18836174ab190d94800cc247f5519f3148822dce/mllib/src/main/scala/org/apache/spark/ml/CC.scala]
> In the spark shell, run:
> {code}
> import org.apache.spark.ml.CC
> import org.apache.spark.sql.SQLContext
> val sqlContext = SQLContext.getOrCreate(sc)
> CC.runTest(sqlContext)
> {code}
> I have attached a file {{cc_traces.txt}} with the stack traces from running {{runTest}}.
 Note that I sometimes had to run {{runTest}} twice to cause the fatal exception.  This includes
a trace for 1.6, which should run without modifications to {{CC.scala}}.  These traces are
from running in local mode.
> I used {{jmap}} to dump the heap:
> * local mode with 2.0: JobProgressListener took about 397 MB
> * standalone mode with 2.0: JobProgressListener took about 171 MB  (See attached screenshots
from MemoryAnalyzer)
> Both 1.6 and 2.0 exhibit this issue.  2.0 ran faster, and the issue (JobProgressListener
allocation) seems more severe with 2.0, though it could just be that 2.0 makes more progress
and runs more jobs.
> h3. With simpler code
> I ran this with master (~Spark 2.0):
> {code}
> val data = spark.range(0, 10000, 1, 10000)
> data.cache().count()
> {code}
> The resulting heap dump:
> * 78MB for {{scala.tools.nsc.interpreter.ILoop$ILoopInterpreter}}
> * 58MB for {{org.apache.spark.ui.jobs.JobProgressListener}}
> * 80MB for {{io.netty.buffer.PoolChunk}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message