spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yong Zhang <java8...@hotmail.com>
Subject Re: Long-running job OOMs driver process
Date Fri, 18 Nov 2016 15:30:04 GMT
Just wondering, is it possible the memory usage keeping going up due to the web UI content?


Yong


________________________________
From: Alexis Seigneurin <aseigneurin@ipponusa.com>
Sent: Friday, November 18, 2016 10:17 AM
To: Nathan Lande
Cc: Keith Bourgoin; Irina Truong; user@spark.incubator.apache.org
Subject: Re: Long-running job OOMs driver process

+1 for using S3A.

It would also depend on what format you're using. I agree with Steve that Parquet, for instance,
is a good option. If you're using plain text files, some people use GZ files but they cannot
be partitioned, thus putting a lot of pressure on the driver. It doesn't look like this is
the issue you're running into, though, because it would not be a progressive slow down, but
please provide as much detail as possible about your app.

The cache could be an issue but the OOM would come from an executor, not from the driver.

>From what you're saying, Keith, it indeed looks like some memory is not being freed. Seeing
the code would help. If you can, also send all the logs (with Spark at least in INFO level).

Alexis

On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande <nathanlande@gmail.com<mailto:nathanlande@gmail.com>>
wrote:

+1 to not threading.

What does your load look like? If you are loading many files and cacheing them in N rdds rather
than 1 rdd this could be an issue.

If the above two things don't fix your oom issue, without knowing anything else about your
job, I would focus on your cacheing strategy as a potential culprit. Try running without any
cacheing to isolate the issue; bad cacheing strategy is the source of oom issues for me most
of the time.

On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <keith@parsely.com<mailto:keith@parsely.com>>
wrote:
Hi Alexis,

Thanks for the response. I've been working with Irina on trying to sort this issue out.

We thread the file processing to amortize the cost of things like getting files from S3. It's
a pattern we've seen recommended in many places, but I don't have any of those links handy.
 The problem isn't the threading, per se, but clearly some sort of memory leak in the driver
itself.  Each file is a self-contained unit of work, so once it's done all memory related
to it should be freed. Nothing in the script itself grows over time, so if it can do 10 concurrently,
it should be able to run like that forever.

I've hit this same issue working on another Spark app which wasn't threaded, but produced
tens of thousands of jobs. Eventually, the Spark UI would get slow, then unresponsive, and
then be killed due to OOM.

I'll try to cook up some examples of this today, threaded and not. We were hoping that someone
had seen this before and it rung a bell. Maybe there's a setting to clean up info from old
jobs that we can adjust.

Cheers,

Keith.

On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <aseigneurin@ipponusa.com<mailto:aseigneurin@ipponusa.com>>
wrote:
Hi Irina,

I would question the use of multiple threads in your application. Since Spark is going to
run the processing of each DataFrame on all the cores of your cluster, the processes will
be competing for resources. In fact, they would not only compete for CPU cores but also for
memory.

Spark is designed to run your processes in a sequence, and each process will be run in a distributed
manner (multiple threads on multiple instances). I would suggest to follow this principle.

Feel free to share to code if you can. It's always helpful so that we can give better advice.

Alexis

On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <irina@parsely.com<mailto:irina@parsely.com>>
wrote:

We have an application that reads text files, converts them to dataframes, and saves them
in Parquet format. The application runs fine when processing a few files, but we have several
thousand produced every day. When running the job for all files, we have spark-submit killed
on OOM:


#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...


The job is written in Python. We're running it in Amazon EMR 5.0 (Spark 2.0.0) with spark-submit.
We're using a cluster with a master c3.2xlarge instance (8 cores and 15g of RAM) and 3 core
c3.4xlarge instances (16 cores and 30g of RAM each). Spark config settings are as follows:


('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),


The job processes each file in a thread, and we have 10 threads running concurrently. The
process will OOM after about 4 hours, at which point Spark has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job is quite small.
Are there any known memory leaks for long-running Spark applications on Yarn?



--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[X]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>



--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[cid:E06DEFDD-9FA0-472D-8830-D7C7B1FEF8E9@home]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>

Mime
View raw message