[ https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Papa updated SPARK-29321:
--------------------------------
Description:
This issue is a clone of the (SPARK-29055). After Spark version 2.3.3, I{color:#172b4d} observe
that the JVM memory is increasing slightly overtime. This behavior also affects the application
performance because when I run my real application in testing environment, after a while the
persisted dataframes stop fitting into the executors memory and I have spill to disk.{color}
{color:#172b4d}JVM memory usage (based on htop command){color}
||Time||RES||SHR||MEM%||
|1min|{color:#de350b}1349{color}|32724|1.5|
|3min|{color:#de350b}1936{color}|32724|2.2|
|5min|{color:#de350b}2506{color}|32724|2.6|
|7min|{color:#de350b}2564{color}|32724|2.7|
|9min|{color:#de350b}2584{color}|32724|2.7|
|11min|{color:#de350b}2585{color}|32724|2.7|
|13min|{color:#de350b}2592{color}|32724|2.7|
|15min|{color:#de350b}2591{color}|32724|2.7|
|17min|{color:#de350b}2591{color}|32724|2.7|
|30min|{color:#de350b}2600{color}|32724|2.7|
|1h|{color:#de350b}2618{color}|32724|2.7|
*HOW TO REPRODUCE THIS BEHAVIOR:*
Reproduce the above behavior, by running the snippet code (I prefer to run without any sleeping
delay) and track the JVM memory with top or htop command.
{code:java}
import time
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(target_dir + f, format="csv")
print("Number of records: {0}".format(df.count()))
time.sleep(15){code}
*TESTED CASES WITH THE SAME BEHAVIOUR:*
* I tested with default settings (spark-defaults.conf)
* Add spark.cleaner.periodicGC.interval 1min (or less)
* {{Turn spark.cleaner.referenceTracking.blocking}}=false
* Run the application in cluster mode
* Increase/decrease the resources of the executors and driver
* I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
-XX:ConcGCThreads=12
* It is also tested with the Spark 2.4.4 (latest) and had the same behavior.
*DEPENDENCIES*
* Operation system: Ubuntu 16.04.3 LTS
* Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
* Python: Python 2.7.12
was:
This issue is a clone of the (SPARK-29055). After Spark version 2.3.3, I{color:#172b4d} observe
that the JVM memory is increasing slightly overtime. This behavior also affects the application
performance because when I run my real application in testing environment, after a while the
persisted dataframes stop fitting into the executors memory and I have spill to disk.{color}
*HOW TO REPRODUCE THIS BEHAVIOR:*
Reproduce the above behavior, by running the snippet code (I prefer to run without any sleeping
delay) and track the JVM memory with top or htop command.
{code:java}
import time
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(target_dir + f, format="csv")
print("Number of records: {0}".format(df.count()))
time.sleep(15){code}
*TESTED CASES WITH THE SAME BEHAVIOUR:*
* I tested with default settings (spark-defaults.conf)
* Add spark.cleaner.periodicGC.interval 1min (or less)
* {{Turn spark.cleaner.referenceTracking.blocking}}=false
* Run the application in cluster mode
* Increase/decrease the resources of the executors and driver
* I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
-XX:ConcGCThreads=12
* It is also tested with the Spark 2.4.4 (latest) and had the same behavior.
*DEPENDENCIES*
* Operation system: Ubuntu 16.04.3 LTS
* Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
* Python: Python 2.7.12
> Possible memory leak in Spark
> -----------------------------
>
> Key: SPARK-29321
> URL: https://issues.apache.org/jira/browse/SPARK-29321
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.3
> Reporter: George Papa
> Priority: Major
>
> This issue is a clone of the (SPARK-29055). After Spark version 2.3.3, I{color:#172b4d}
observe that the JVM memory is increasing slightly overtime. This behavior also affects the
application performance because when I run my real application in testing environment, after
a while the persisted dataframes stop fitting into the executors memory and I have spill to
disk.{color}
> {color:#172b4d}JVM memory usage (based on htop command){color}
> ||Time||RES||SHR||MEM%||
> |1min|{color:#de350b}1349{color}|32724|1.5|
> |3min|{color:#de350b}1936{color}|32724|2.2|
> |5min|{color:#de350b}2506{color}|32724|2.6|
> |7min|{color:#de350b}2564{color}|32724|2.7|
> |9min|{color:#de350b}2584{color}|32724|2.7|
> |11min|{color:#de350b}2585{color}|32724|2.7|
> |13min|{color:#de350b}2592{color}|32724|2.7|
> |15min|{color:#de350b}2591{color}|32724|2.7|
> |17min|{color:#de350b}2591{color}|32724|2.7|
> |30min|{color:#de350b}2600{color}|32724|2.7|
> |1h|{color:#de350b}2618{color}|32724|2.7|
>
> *HOW TO REPRODUCE THIS BEHAVIOR:*
> Reproduce the above behavior, by running the snippet code (I prefer to run without
any sleeping delay) and track the JVM memory with top or htop command.
> {code:java}
> import time
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> target_dir = "..."
> spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
> while True:
> for f in os.listdir(target_dir):
> df = spark.read.load(target_dir + f, format="csv")
> print("Number of records: {0}".format(df.count()))
> time.sleep(15){code}
>
> *TESTED CASES WITH THE SAME BEHAVIOUR:*
> * I tested with default settings (spark-defaults.conf)
> * Add spark.cleaner.periodicGC.interval 1min (or less)
> * {{Turn spark.cleaner.referenceTracking.blocking}}=false
> * Run the application in cluster mode
> * Increase/decrease the resources of the executors and driver
> * I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
-XX:ConcGCThreads=12
> * It is also tested with the Spark 2.4.4 (latest) and had the same behavior.
>
> *DEPENDENCIES*
> * Operation system: Ubuntu 16.04.3 LTS
> * Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
> * Python: Python 2.7.12
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|