spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Bourgoin <ke...@parsely.com>
Subject Re: Long-running job OOMs driver process
Date Fri, 18 Nov 2016 15:36:35 GMT
Thanks for the input. I had read somewhere that s3:// was the way to go due
to some recent changes, but apparently that was outdated. I’m working on
creating some dummy data and a script to process it right now. I’ll post
here with code and logs when I can successfully reproduce the issue on
non-production data.

Yong, that's a good point about the web content. I had forgotten to mention
that when I first saw this a few months ago, on another project, I could
sometimes trigger the OOM by trying to view the web ui for the job. That's
another case I'll try to reproduce.

Thanks again!

Keith.

On Fri, Nov 18, 2016 at 10:30 AM Yong Zhang <java8964@hotmail.com> wrote:

> Just wondering, is it possible the memory usage keeping going up due to
> the web UI content?
>
>
> Yong
>
>
> ------------------------------
> *From:* Alexis Seigneurin <aseigneurin@ipponusa.com>
> *Sent:* Friday, November 18, 2016 10:17 AM
> *To:* Nathan Lande
> *Cc:* Keith Bourgoin; Irina Truong; user@spark.incubator.apache.org
> *Subject:* Re: Long-running job OOMs driver process
>
> +1 for using S3A.
>
> It would also depend on what format you're using. I agree with Steve that
> Parquet, for instance, is a good option. If you're using plain text files,
> some people use GZ files but they cannot be partitioned, thus putting a lot
> of pressure on the driver. It doesn't look like this is the issue you're
> running into, though, because it would not be a progressive slow down, but
> please provide as much detail as possible about your app.
>
> The cache could be an issue but the OOM would come from an executor, not
> from the driver.
>
> From what you're saying, Keith, it indeed looks like some memory is not
> being freed. Seeing the code would help. If you can, also send all the logs
> (with Spark at least in INFO level).
>
> Alexis
>
> On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande <nathanlande@gmail.com>
> wrote:
>
> +1 to not threading.
>
> What does your load look like? If you are loading many files and cacheing
> them in N rdds rather than 1 rdd this could be an issue.
>
> If the above two things don't fix your oom issue, without knowing anything
> else about your job, I would focus on your cacheing strategy as a potential
> culprit. Try running without any cacheing to isolate the issue; bad
> cacheing strategy is the source of oom issues for me most of the time.
>
> On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <keith@parsely.com> wrote:
>
> Hi Alexis,
>
> Thanks for the response. I've been working with Irina on trying to sort
> this issue out.
>
> We thread the file processing to amortize the cost of things like getting
> files from S3. It's a pattern we've seen recommended in many places, but I
> don't have any of those links handy.  The problem isn't the threading, per
> se, but clearly some sort of memory leak in the driver itself.  Each file
> is a self-contained unit of work, so once it's done all memory related to
> it should be freed. Nothing in the script itself grows over time, so if it
> can do 10 concurrently, it should be able to run like that forever.
>
> I've hit this same issue working on another Spark app which wasn't
> threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
> would get slow, then unresponsive, and then be killed due to OOM.
>
> I'll try to cook up some examples of this today, threaded and not. We were
> hoping that someone had seen this before and it rung a bell. Maybe there's
> a setting to clean up info from old jobs that we can adjust.
>
> Cheers,
>
> Keith.
>
> On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <
> aseigneurin@ipponusa.com> wrote:
>
> Hi Irina,
>
> I would question the use of multiple threads in your application. Since
> Spark is going to run the processing of each DataFrame on all the cores of
> your cluster, the processes will be competing for resources. In fact, they
> would not only compete for CPU cores but also for memory.
>
> Spark is designed to run your processes in a sequence, and each process
> will be run in a distributed manner (multiple threads on multiple
> instances). I would suggest to follow this principle.
>
> Feel free to share to code if you can. It's always helpful so that we can
> give better advice.
>
> Alexis
>
> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <irina@parsely.com> wrote:
>
> We have an application that reads text files, converts them to dataframes,
> and saves them in Parquet format. The application runs fine when processing
> a few files, but we have several thousand produced every day. When running
> the job for all files, we have spark-submit killed on OOM:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 27226"...
>
> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
> and 30g of RAM each). Spark config settings are as follows:
>
> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>
> ('spark.executors.instances', '3'),
>
> ('spark.yarn.executor.memoryOverhead', '9g'),
>
> ('spark.executor.cores', '15'),
>
> ('spark.executor.memory', '12g'),
>
> ('spark.scheduler.mode', 'FIFO'),
>
> ('spark.cleaner.ttl', '1800'),
>
> The job processes each file in a thread, and we have 10 threads running
> concurrently. The process will OOM after about 4 hours, at which point
> Spark has processed over 20,000 jobs.
>
> It seems like the driver is running out of memory, but each individual job
> is quite small. Are there any known memory leaks for long-running Spark
> applications on Yarn?
>
>
>
>
> --
>
> *Alexis Seigneurin *
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>
>
>
>
> --
>
> *Alexis Seigneurin *
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>

Mime
View raw message