spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: pyspark - memory leak leading to OOM after submitting 100 jobs?
Date Fri, 01 Nov 2019 11:08:41 GMT
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris <nicolas.paris@riseup.net>
wrote:

> have you deactivated the spark.ui ?
> I have read several thread explaining the ui can lead to OOM because it
> stores 1000 dags by default
>
>
> On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> > Dear List,
> >
> > I've observed some sort of memory leak when using pyspark to run ~100
> > jobs in local mode.  Each job is essentially a create RDD -> create DF
> > -> write DF sort of flow.  The RDD and DFs go out of scope after each
> > job completes, hence I call this issue a "memory leak."  Here's
> > pseudocode:
> >
> > ```
> > row_rdds = []
> > for i in range(100):
> >   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in
> range(1000)])
> >   row_rdds.append(row_rdd)
> >
> > for row_rdd in row_rdds:
> >   df = spark.createDataFrame(row_rdd)
> >   df.persist()
> >   print(df.count())
> >   df.write.save(...) # Save parquet
> >   df.unpersist()
> >
> >   # Does not help:
> >   # del df
> >   # del row_rdd
> > ```

The connection between Python GC/del and JVM GC is perhaps a bit weaker
than we might like. There certainly could be a problem here, but it still
shouldn’t be getting to the OOM state.

>
> >
> > In my real application:
> >  * rows are much larger, perhaps 1MB each
> >  * row_rdds are sized to fit available RAM
> >
> > I observe that after 100 or so iterations of the second loop (each of
> > which creates a "job" in the Spark WebUI), the following happens:
> >  * pyspark workers have fairly stable resident and virtual RAM usage
> >  * java process eventually approaches resident RAM cap (8GB standard)
> > but virtual RAM usage keeps ballooning.
> >

Can you share what flags the JVM is launching with? Also which JVM(s) are
ballooning?

>
> > Eventually the machine runs out of RAM and the linux OOM killer kills
> > the java process, resulting in an "IndexError: pop from an empty
> > deque" error from py4j/java_gateway.py .
> >
> >
> > Does anybody have any ideas about what's going on?  Note that this is
> > local mode.  I have personally run standalone masters and submitted a
> > ton of jobs and never seen something like this over time.  Those were
> > very different jobs, but perhaps this issue is bespoke to local mode?
> >
> > Emphasis: I did try to del the pyspark objects and run python GC.
> > That didn't help at all.
> >
> > pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> >
> > 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> >
> > Cheers,
> > -Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Mime
View raw message