spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: spark single PROCESS_LOCAL task
Date Tue, 19 Jul 2016 17:06:23 GMT
So its possible that you have a lot of data in one of the partitions which
is local to that process, maybe you could cache & count the upstream RDD
and see what the input partitions look like? On the otherhand - using
groupByKey is often a bad sign to begin with - can you rewrite your code to
avoid this?

On Fri, Jul 15, 2016 at 10:57 AM, Matt K <matvey1414@gmail.com> wrote:

> Hi all,
>
> I'm seeing some curious behavior which I have a hard time interpreting. I
> have a job which does a "groupByKey" and results in 300 executors. 299 are
> run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode.
>
> The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much
> input as the other executors. It dies with OOM, and the job fails.
>
> Only working theory I have is that there's a single key which has a ton of
> data tied to it. Even so, I can't explain why it's run in PROCESS_LOCAL
> mode and not others.
>
> Anyone has ideas?
>
> Thanks,
> -Matt
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Mime
View raw message