spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brad Willard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general
Date Fri, 09 Jan 2015 18:55:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271706#comment-14271706
] 

Brad Willard commented on SPARK-5075:
-------------------------------------

I wanted to add that this is greatly exacerbated by a dataset with nested structure. If a
json object has a nested json object, and your queries include attributes on the nested objects,
this correlates with the memory explosion.

I'm tempted to say spark should not let you use nested json objects. It murders performance
everywhere.

> Memory Leak when repartitioning SchemaRDD or running queries in general
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5075
>                 URL: https://issues.apache.org/jira/browse/SPARK-5075
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.2.0
>         Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
>            Reporter: Brad Willard
>              Labels: ec2, json, memory-leak, memory_leak, parquet, pyspark, repartition,
s3
>
> I'm trying to repartition a json dataset for better cpu optimization and save in parquet
format for better performance. The Json dataset is about 200gb
> from pyspark.sql import SQLContext
> sql_context = SQLContext(sc)
> rdd = sql_context.jsonFile('s3c://some_path')
> rdd = rdd.repartition(256)
> rdd.saveAsParquetFile('hdfs://some_path')
> In ganglia when the dataset first loads it's about 200G in memory which is expected.
However once it attempts the repartition, it balloons over 2.5x in memory which is never returned
making any subsequent operations fail from memory errors.
> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
> I'm also seeing a similar memory leak behavior when running repeated queries on a dataset.
> rdd = sql_context.parquetFile('hdfs://some_path')
> rdd.registerTempTable('events')
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> sql_context.sql(  anything  )
> will result in a memory usage pattern of.
> http://cl.ly/image/180y2D3d1A0X
> It seems like intermediate results are not being garbage collected or something. Eventually
I have to kill my session to keep running queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message