spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: the spark job is so slow - almost frozen
Date Fri, 22 Jul 2016 02:07:34 GMT
Andrew,

you have pretty much consolidated my entire experience, please give a
presentation in a meetup on this, and send across the links :)


Regards,
Gourav

On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich <andrew@aehrlich.com> wrote:

> Try:
>
> - filtering down the data as soon as possible in the job, dropping columns
> you don’t need.
> - processing fewer partitions of the hive tables at a time
> - caching frequently accessed data, for example dimension tables, lookup
> tables, or other datasets that are repeatedly accessed
> - using the Spark UI to identify the bottlenecked resource
> - remove features or columns from the output data, until it runs, then add
> them back in one at a time.
> - creating a static dataset small enough to work, and editing the query,
> then retesting, repeatedly until you cut the execution time by a
> significant fraction
> - Using the Spark UI or spark shell to check the skew and make sure
> partitions are evenly distributed
>
> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID
> <zchl.jump@yahoo.com.invalid>> wrote:
>
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive
> (by HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosudau@gmail.com> wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show
> what going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID
> <zchl.jump@yahoo.com.invalid>> wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes   here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <
> zchl.jump@yahoo.com.INVALID <zchl.jump@yahoo.com.invalid>> wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably
> slow - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>
>
>
>
>
>
>

Mime
View raw message