spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Khaitman (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5395) Large number of Python workers causing resource depletion
Date Mon, 26 Jan 2015 23:57:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292668#comment-14292668
] 

Mark Khaitman edited comment on SPARK-5395 at 1/26/15 11:57 PM:
----------------------------------------------------------------

[~skrasser], I actually only managed to have this reproduced using production data as well
(so far). I'll try to write a simple version tomorrow but it seems that it's a mix of both
python worker processes not being killed after it's no longer running (causing build up),
as well as the python worker exceeding the allocated memory limit. 

I think it *may* be related to a couple of specific actions such as groupByKey/cogroup, though
I'll still need to do some tests to be sure what's causing this.

I should also add that we haven't modified the default for the python.worker.reuse variable,
so in our case it should be using the default of True.


was (Author: mkman84):
[~skrasser], I actually only managed to have this reproduced using production data as well
(so far). I'll try to write a simple version tomorrow but it seems that it's a mix of both
python worker processes not being killed after it's no longer running (causing build up),
as well as the python worker exceeding the allocated memory limit. 

I think it *may* be related to a couple of specific actions such as groupByKey/cogroup, though
I'll still need to do some tests to be sure what's causing this.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually causing YARN
to kill containers for being over their memory allocation (in the case below that is about
8G for executors plus 6G for overhead per container). 
> In this instance, at the time of killing the container 97 pyspark.daemon processes had
accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59))
- Container marked as failed: container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics:
Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is running beyond
physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of
72.5 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES)
RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon
> 	[...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message