spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiliang Zhu <>
Subject Re: the spark job is so slow - almost frozen
Date Thu, 21 Jul 2016 04:02:22 GMT
Thanks a lot for your kind help.  

    On Wednesday, July 20, 2016 11:35 AM, Andrew Ehrlich <> wrote:

- filtering down the data as soon as possible in the job, dropping columns you don’t need.-
processing fewer partitions of the hive tables at a time- caching frequently accessed data,
for example dimension tables, lookup tables, or other datasets that are repeatedly accessed-
using the Spark UI to identify the bottlenecked resource- remove features or columns from
the output data, until it runs, then add them back in one at a time.- creating a static dataset
small enough to work, and editing the query, then retesting, repeatedly until you cut the
execution time by a significant fraction- Using the Spark UI or spark shell to check the skew
and make sure partitions are evenly distributed

On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <> wrote:
Thanks a lot for your reply .
In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext) respectively,
the job seems frozen  to finish to run .
In the 6 tables , need to respectively read the different columns in different tables for
specific information , then do some simple calculation before output . join operation is
used most in the sql . 
Best wishes! 


    On Monday, July 18, 2016 6:24 PM, Chanh Le <> wrote:

 Hi,What about the network (bandwidth) between hive and spark? Does it run in Hive before
then you move to Spark?Because It's complex you can use something like EXPLAIN command to
show what going on.

On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <> wrote:
the sql logic in the program is very much complex , so do not describe the detailed codes
  here .  

    On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <> wrote:

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive tables, and
then does some easy calculation, there is around 100,000 number of rows in each table,finally
need to output another table or file (with format of consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - sometimes almost
frozen. There is 5 nodes for spark cluster.  Could anyone offer some help, some idea or
clue is also good. 
Thanks in advance~



View raw message