Hey Adam,

I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than I'd expect.

Does anyone know if that is actually the intended behavior, i.e. in this case over 90 Python processes on a 2 core executor?

Best,
-Sven


On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz <adam.h.diaz@gmail.com> wrote:
Yarn only has the ability to kill not checkpoint or sig suspend.  If you use too much memory it will simply kill tasks based upon the yarn config.
https://issues.apache.org/jira/browse/YARN-2172


On Friday, January 23, 2015, Sandy Ryza <sandy.ryza@cloudera.com> wrote:
Hi Sven,

What version of Spark are you running?  Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task.

-Sandy

On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser <krasser@gmail.com> wrote:
Hey all,

I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system).

Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count?

2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_000030] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1421692415636_0052_01_000030 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon
|- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon
|- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon
	[...]

Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c

Thank you!
-Sven




--
http://sites.google.com/site/krasser/?utm_source=sig