spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiliang Zhu <zchl.j...@yahoo.com.INVALID>
Subject Re: spark job automatically killed without rhyme or reason
Date Mon, 20 Jun 2016 08:17:06 GMT
Hi Alexander ,
Thanks a lot for your comments.
Spark seems not that stable when it comes to run big job, too much data or too much time,
yes, the problem is gone when reducing the scale.Sometimes reset some job running parameter
(such as --drive-memory may help in GC issue) , sometimes may rewrite the codes by applying
other algorithm.
As you commented the shuffle operation, it sounds some as the reason ...
Best Wishes !  
 

    On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kpavn@hotmail.com> wrote:
 

 #yiv4291334619 #yiv4291334619 -- _filtered #yiv4291334619 {font-family:Wingdings;panose-1:5
0 0 0 0 0 0 0 0 0;} _filtered #yiv4291334619 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv4291334619
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv4291334619 #yiv4291334619 p.yiv4291334619MsoNormal,
#yiv4291334619 li.yiv4291334619MsoNormal, #yiv4291334619 div.yiv4291334619MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv4291334619
a:link, #yiv4291334619 span.yiv4291334619MsoHyperlink {color:blue;text-decoration:underline;}#yiv4291334619
a:visited, #yiv4291334619 span.yiv4291334619MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv4291334619
p.yiv4291334619MsoListParagraph, #yiv4291334619 li.yiv4291334619MsoListParagraph, #yiv4291334619
div.yiv4291334619MsoListParagraph {margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv4291334619
span.yiv4291334619qtd-expansion-text {}#yiv4291334619 .yiv4291334619MsoChpDefault {} _filtered
#yiv4291334619 {margin:2.0cm 42.5pt 2.0cm 3.0cm;}#yiv4291334619 div.yiv4291334619WordSection1
{}#yiv4291334619 _filtered #yiv4291334619 {} _filtered #yiv4291334619 {font-family:Symbol;}
_filtered #yiv4291334619 {} _filtered #yiv4291334619 {font-family:Wingdings;} _filtered #yiv4291334619
{font-family:Symbol;} _filtered #yiv4291334619 {} _filtered #yiv4291334619 {font-family:Wingdings;}
_filtered #yiv4291334619 {font-family:Symbol;} _filtered #yiv4291334619 {} _filtered #yiv4291334619
{font-family:Wingdings;}#yiv4291334619 ol {margin-bottom:0cm;}#yiv4291334619 ul {margin-bottom:0cm;}#yiv4291334619
Hi Zhiliang,    Yes, find the exact reason of failure is very difficult. We have issue with
similar behavior, due to limited time for investigation, we reduce the number of processed
data, and problem has gone.    Some points which may help you in investigations: ·        If
you start spark-history-server (or monitoring running application on 4040 port), look into
failed stages (if any). By default Spark try to retry stage execution 2 times, after that
job fails·        Some useful information may contains in yarn logs on Hadoop nodes
(yarn-<user>-nodemanager-<host>.log), but this is only information about killed
container, not about the reasons why this stage took so much memory   As I can see in your
logs, failed step relates to shuffle operation, could you change your job to avoid massive
shuffle operation?    --WBR, Alexander   From: Zhiliang Zhu
Sent: 17 июня 2016 г. 14:10
To: User; kpavn@hotmail.com
Subject: Re: spark job automatically killed without rhyme or reason   
Show original message

Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, due to sometimes
successful job may also have some kinds of the error  ... but will repair itself.spark seems
not that stable currently     ...
Thank you in advance~  

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.jump@yahoo.com> wrote:


Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn logs -applicationId
id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection to ip-172-31-20-104/172.31.20.104:49991,
creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1
outstanding blocksjava.io.IOException:Failed to connect to ip-172-31-20-104/172.31.20.104:49991
             <------may it be due to that spark is not stable, and spark may repair
itself for these kinds of error ? (saw some in successful run )
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) 
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)............Caused
by: java.net.ConnectException: Connection refused: ip-172-31-20-104/172.31.20.104:49991 
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) 
      at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) 
      at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) 
      at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
      at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) 
      at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor:Managed memory leak detected; size = 16777216 bytes,
TID = 100323           <-----       would it be memory leak issue? though no GC
exception threw for other normal kinds of out of memory 16/06/17 11:54:38 ERROR executor.Executor:
Exception in task 145.0 in stage 112.0 (TID 100323)java.io.IOException: Filesystem closed 
      at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679) 
      at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)       
at java.io.DataInputStream.readFully(DataInputStream.java:195)        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265) 
      at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...........
sorry, there is some information in the middle of the log file, but all is okay at the end
 part of the log .in the run log file as log_file generated by command:nohup spark-submit
--driver-memory 20g  --num-executors 20 --class com.dianrong.Main  --master yarn-client
 dianrong-retention_2.10-1.0.jar  doAnalysisExtremeLender  /tmp/drretention/test/output
 0.96  /tmp/drretention/evaluation/test_karthik/lgmodel   /tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
50 > log_file

executor 40 lost                        <------    would it be due to this,
sometimes job may fail for the reason
..........
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)     
  at java.io.DataInputStream.readFully(DataInputStream.java:195)        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265) 
      at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..........

Thanks in advance!




On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kpavn@hotmail.com> wrote:


#yiv4291334619 #yiv4291334619 --#yiv4291334619x_yiv7679307012 {}#yiv4291334619 #yiv4291334619x_yiv7679307012
{font-family:Calibri;}#yiv4291334619 #yiv4291334619x_yiv7679307012 p.yiv4291334619x_yiv7679307012MsoNormal,
#yiv4291334619 #yiv4291334619x_yiv7679307012 li.yiv4291334619x_yiv7679307012MsoNormal, #yiv4291334619
#yiv4291334619x_yiv7679307012 div.yiv4291334619x_yiv7679307012MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv4291334619
#yiv4291334619x_yiv7679307012 #yiv4291334619x_yiv7679307012 span.yiv4291334619x_yiv7679307012MsoHyperlink
{color:blue;text-decoration:underline;}#yiv4291334619 #yiv4291334619x_yiv7679307012 #yiv4291334619x_yiv7679307012
span.yiv4291334619x_yiv7679307012MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv4291334619
#yiv4291334619x_yiv7679307012 {margin:2.0cm 42.5pt 2.0cm 3.0cm;}#yiv4291334619 Hi, Did you
submit spark job via YARN? In some cases (memory configuration probably), yarn can kill containers
where spark tasks are executed. In this situation, please check yarn userlogs for more information… --WBR,
Alexander From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu;User
Subject: Re: spark job automatically killed without rhyme or reason anyone ever met the similar
problem, which is quite strange ... 

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID> wrote:


Hi All,
I have a big job which mainly takes more than one hour to run the whole, however, it is very
much unreasonable to exit & finish to run midway (almost 80% of the job finished actually,
but not all), without any apparent error or exception log.
I submitted the same job for many times, it is same as that.In the last line of the run log,
just one word "killed" to end, or sometimes not any  other wrong log, all seems okay but
should not finish.
What is the way for the problem? Is there any other friends that ever met the similar issue
...
Thanks in advance!  









  
Mime
View raw message