spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Rovner <>
Subject Re: How to optimize group by query fired using hiveContext.sql?
Date Sat, 03 Oct 2015 12:57:07 GMT
This sounds like you need to increase YARN overhead settings with the
parameter. See for
more information on the setting.

If that does not work for you, please provide the error messages and the
command line you are using to submit your jobs for further troubleshooting.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <>*

On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <> wrote:

> Hi I have couple of Spark jobs which uses group by query which is getting
> fired from hiveContext.sql() Now I know group by is evil but my use case I
> cant avoid group by I have around 7-8 fields on which I need to do group
> by.
> Also I am using df1.except(df2) which also seems heavy operation and does
> lots of shuffling please see my UI snap
> <
> >
> I have tried almost all optimisation including Spark 1.5 but nothing seems
> to be working and my job fails hangs because of executor will reach
> physical
> memory limit and YARN will kill it. I have around 1TB of data to process
> and
> it is skewed. Please guide.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message