spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: spark single PROCESS_LOCAL task
Date Tue, 19 Jul 2016 17:06:23 GMT
So its possible that you have a lot of data in one of the partitions which
is local to that process, maybe you could cache & count the upstream RDD
and see what the input partitions look like? On the otherhand - using
groupByKey is often a bad sign to begin with - can you rewrite your code to
avoid this?

On Fri, Jul 15, 2016 at 10:57 AM, Matt K <> wrote:

> Hi all,
> I'm seeing some curious behavior which I have a hard time interpreting. I
> have a job which does a "groupByKey" and results in 300 executors. 299 are
> run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode.
> The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much
> input as the other executors. It dies with OOM, and the job fails.
> Only working theory I have is that there's a single key which has a ton of
> data tied to it. Even so, I can't explain why it's run in PROCESS_LOCAL
> mode and not others.
> Anyone has ideas?
> Thanks,
> -Matt

Cell : 425-233-8271

View raw message