spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Khaitman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5395) Large number of Python workers causing resource depletion
Date Tue, 27 Jan 2015 01:04:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292769#comment-14292769
] 

Mark Khaitman commented on SPARK-5395:
--------------------------------------

This may prove to be useful...

I'm watching a presently running spark-submitted job, while watching the pyspark.daemon processes.
The framework is permitted to only use 8 cores on each node with the default python worker
memory of 512mb per node (not the executor memory which is set to higher than this).

Ignoring the exact RDD actions for a moment, it looks like while it transitions from Stage
1 -> Stage 2, it spawned up 8-10 additional pyspark.daemon processes making the box use
more cores than it was even allowed to... A few seconds after that, the other 8 processes
entered a sleeping state while still holding onto the physical memory it ate up in Stage 1.
As soon as Stage 2 finished, practically all of the pyspark.daemons vanished and freed up
the memory usage. I was keeping an eye on 2 random nodes and the exact same thing occurred
on both. It was also the only currently executing job at the time so there was really no other
interference/contention for resources.

I will try to provide a bit more detail on the exact transformations/actions occurring between
the 2 stages, although I know a PartionBy and cogroup are occurring at the very least without
inspecting the spark-submitted code directly.

> Large number of Python workers causing resource depletion
> ---------------------------------------------------------
>
>                 Key: SPARK-5395
>                 URL: https://issues.apache.org/jira/browse/SPARK-5395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: AWS ElasticMapReduce
>            Reporter: Sven Krasser
>
> During job execution a large number of Python worker accumulates eventually causing YARN
to kill containers for being over their memory allocation (in the case below that is about
8G for executors plus 6G for overhead per container). 
> In this instance, at the time of killing the container 97 pyspark.daemon processes had
accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59))
- Container marked as failed: container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics:
Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is running beyond
physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of
72.5 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES)
RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon
> 	[...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message