spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
Date Thu, 07 May 2015 00:27:52 GMT
It looks like you have data in these 24 partitions, or more. How many unique name in your data
set?
Enlarge the shuffle partitions only make sense if you have large partition groups in your
data. What you described looked like either your dataset having data in these 24 partitions,
or you have skew data in these 24 partitions.
If you really join a 56M data with 26M data, I am surprised that you will have 24 partitions
running very slow, under 8G executor.
Yong
Date: Wed, 6 May 2015 14:04:11 +0800
From: luohui20001@sina.com
To: luohui20001@sina.com; hao.cheng@intel.com; daoyuan.wang@intel.com; ssaboum@gmail.com;
user@spark.apache.org
Subject: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

update status after i did some tests. I modified some other parameters, found 2 parameters
maybe relative.
spark_worker_instance and spark.sql.shuffle.partitions


before Today I used default setting of spark_worker_instance and spark.sql.shuffle.partitions
whose value is 1 and 200.At that time , my app stops running at 5/200tasks.


then I changed spark_worker_instance to 2, then my app process moved on to about 116/200 tasks.and
then changed spark_worker_instance to 4, then I can get a further progress at 176/200.however
when i changed to 8 or even more ,like 12 works, it is still 176/200


Later new founds comes to me while I am trying with different spark.sql.shuffle.partitions.
If I changed to 50,400,800 partitions, it stops at 26/50, 376/400,776/800 tasks. always leaving
24 tasks unable to finish.


Not sure why those happens.Hope this info could be helpful to solve it.



--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo

----- 原始邮件 -----
发件人:<luohui20001@sina.com>
收件人:"Cheng, Hao" <hao.cheng@intel.com>, "Wang, Daoyuan" <daoyuan.wang@intel.com>,
"Olivier Girardot" <ssaboum@gmail.com>, "user" <user@spark.apache.org>,
主题:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
日期:2015年05月06日 09点51分

db has 1.7million records while sample has 0.6million. jvm settings i tried default settings
and also tried to apply 4g by "export _java_opts 4g", app still stops running.
BTW, here are some details info about gc and jvm.
----- 原始邮件 -----
发件人:"Cheng, Hao" <hao.cheng@intel.com>
收件人:"luohui20001@sina.com" <luohui20001@sina.com>, "Wang, Daoyuan" <daoyuan.wang@intel.com>,
Olivier Girardot <ssaboum@gmail.com>, user <user@spark.apache.org>
主题:RE: 回复:Re: sparksql running slow while joining_2_tables.
日期:2015年05月05日 20点50分
56mb / 26mb is very small size, do you observe data skew? More precisely, many records with
the same chrname / name?  And can you also double check the jvm settings
 for the executor process?
 
 
From: luohui20001@sina.com [mailto:luohui20001@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining_2_tables.
 
Hi guys,
          attache the pic of physical plan and logs.Thanks.
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 
原始邮件 -----
发件人:"Cheng, Hao" <hao.cheng@intel.com>
收件人:"Wang, Daoyuan" <daoyuan.wang@intel.com>, "luohui20001@sina.com" <luohui20001@sina.com>,
 Olivier Girardot <ssaboum@gmail.com>, user <user@spark.apache.org>
主题:Re: sparksql running slow while joining_2_tables.
日期:2015年05月05日 13点18分
 
I assume you’re using the DataFrame API within your application.
 
sql(“SELECT…”).explain(true)
 
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20001@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复:RE:
回复:Re: sparksql running slow while joining_2_tables.
 
You can use
Explain extended select ….
 
From:
luohui20001@sina.com [mailto:luohui20001@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复:RE:
回复:Re: sparksql running slow while joining_2_tables.
 
As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold.
refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
 
and how to check my app's physical plan,and others things like optimized plan,executable plan.etc
 
thanks
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人:"Cheng, Hao" <hao.cheng@intel.com>
收件人:"Cheng, Hao" <hao.cheng@intel.com>, "luohui20001@sina.com" <luohui20001@sina.com>,
 Olivier Girardot <ssaboum@gmail.com>, user <user@spark.apache.org>
主题:RE: 
回复:Re: sparksql running slow while joining_2_tables.
日期:2015年05月05日 08点38分
 
Or, have you ever try broadcast join?
 
From: Cheng, Hao [mailto:hao.cheng@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20001@sina.com; Olivier Girardot; user
Subject: RE: 回复:Re: sparksql running slow while joining 2 tables.
 
Can you print out the physical plan?
 
EXPLAIN SELECT xxx…
 
From: luohui20001@sina.com [mailto:luohui20001@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining 2 tables.
 
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics .
it seems like a GC issue. I also tried with different parameters like memory size of driver&executor,
memory fraction, java opts...
but this issue still happens.
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人:Olivier Girardot <ssaboum@gmail.com>
收件人:luohui20001@sina.com, user <user@spark.apache.org>
主题:Re: sparksql running slow while joining 2 tables.
日期:2015年05月04日 20点46分
 
Hi, 
What is you Spark version ?
 
Regards, 
 
Olivier.
 
Le lun. 4 mai 2015 à 11:03, <luohui20001@sina.com> a
écrit :
hi guys
        when i am running a sql  like "select 
a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name =
b.name) where (b.startpoint > a.startpoint &#43; 25);" I found sparksql running slow
in minutes which may caused by very long GC and shuffle time.
 
       table db is created from a txt file size at 56mb while table sample sized at 26mb,
both at small size.
       my spark cluster is a standalone  pseudo-distributed spark cluster with 8g executor
and 4g driver manager.
       any advises? thank you guys.
 
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscribe@spark.apache.org
For additional commands, e-mail: 
user-help@spark.apache.org 		 	   		  
Mime
View raw message