spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltán Zvara (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SPARK-14652) pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver OOM
Date Fri, 16 Mar 2018 17:31:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402230#comment-16402230
] 

Zoltán Zvara commented on SPARK-14652:
--------------------------------------

Reproduced the problem in Spark 2.3.0:
 * In one online ML pipeline I annotate documents incoming from a stream, using 8 different
models.
 * Most of application logic is incorporated into one {{foreachRDD}}.
 * Calling {{unpersist}} at several points of execution is not possible, since no action is
called until the end of the output operator.
 * References of {{persisted}} RDDs are lost in application code - persisted RDDs are not
collected in order to unpersist them at the end of the output operator.
 * Spark fails to clean-up cached RDDs, leading to GC overhead exception on driver.

One workaround:
 * Load all models and static RDDs before {{StreamingContext}} is started.
 * At the first micro batch, add current RDDs to a white list.
 * Using {{StreamingListener}} periodically clear old RDDs that are not in the white list
(but keep some of the recent RDDs as well just to make sure).

> pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver
OOM
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-14652
>                 URL: https://issues.apache.org/jira/browse/SPARK-14652
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams, PySpark
>    Affects Versions: 1.6.1
>         Environment: pyspark 1.6.1
> python 2.7.6
> Ubuntu 14.04.2 LTS
> Oracle JDK 1.8.0_77
>            Reporter: Wei Deng
>            Priority: Major
>
> ContextCleaner was introduced in SPARK-1103 and according to its PR [here|https://github.com/apache/spark/pull/126]:
> {quote}
> RDD cleanup:
> {{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted RDDs. Regarding
metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all
jobs have completed. Only the {{SparkContext.persistentRDDs}} keeps strong references to persisted
RDDs. The {{TimeStampedHashMap}} used for that has been replaced by {{TimeStampedWeakValueHashMap}}
that keeps only weak references to the RDDs, allowing them to be garbage collected.
> {quote}
> However, we have observed that for a cached RDD in pyspark streaming code this is not
the case with the current latest Spark 1.6.1 version. This is reflected in the forever growing
number of RDDs in the {{Storage}} tab of the Spark Streaming application's UI once a pyspark
streaming code starts to run. We used the [writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py]
code to reproduce the problem, and every time after running for 20+ hours, the driver's JVM
will start to show signs of OOM, with old gen being filled up and JVM stuck in full GC cycles
without any old gen JVM space being freed up, and eventually the driver will crash with OOM.
> We have collected heap dump right before the OOM happened, and can make it available
for analysis if it's considered as useful. However, it might be easier to just monitor the
growth of the number of RDDs in the {{Storage}} tab from the Spark application's UI to confirm
this is happening. To illustrate the problem, we also tried to set {{--conf spark.cleaner.periodicGC.interval=10s}}
in the spark-submit command line of pyspark code and enabled DEBUG level logging of the driver's
logback.xml and confirmed that even if the cleaner gets triggered as quickly as every 10 seconds,
none of the cached RDDs will be unpersisted automatically by ContextCleaner.
> Currently we have resorted to manually calling unpersist() to work around the problem.
However, this goes against the spirit of SPARK-1103, i.e. automated garbage collection in
the SparkContext.
> We also conducted a simple test with Scala code and with setting {{--conf spark.cleaner.periodicGC.interval=10s}},
and found the Scala code was able to clean up the RDDs every 10 seconds as expected, so this
appears to be a pyspark specific issue. We suspect it has something to do with python not
being able to pass those out of scope RDDs as weak references to the Context Cleaner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message