spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiliang Zhu <zchl.j...@yahoo.com.INVALID>
Subject Re: spark job automatically killed without rhyme or reason
Date Thu, 23 Jun 2016 06:21:52 GMT
Thanks a lot for all  the comments, and the useful  information . 
Yes, I have much experience to write and run spark jobs, something unstable will be there
while it run on more data or more time. Sometimes it would be not okay while reset some parameter
in command line, but will be okay while removing it by using default setting. Sometimes it
is opposite, proper parameter setting needs to be set.
Here is installing spark 1.5 by other person.

 

    On Wednesday, June 22, 2016 1:59 PM, Nirav Patel <npatel@xactlycorp.com> wrote:
 

 spark is memory hogger and suicidal if you have a job processing bigger dataset. however
databricks claims that  spark > 1.6  have optimization related to memory footprint as
well as processing. It will only be available if you use dataframe or dataset. if you are
using rdd you have to do lot of testing and tuning. 
On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <sowen@cloudera.com> wrote:

I'm not sure that's the conclusion. It's not trivial to tune and
configure YARN and Spark to match your app's memory needs and profile,
but, it's also just a matter of setting them properly. I'm not clear
you've set the executor memory for example, in particular
spark.yarn.executor.memoryOverhead

Everything else you mention is a symptom of YARN shutting down your
jobs because your memory settings don't match what your app does.
They're not problems per se, based on what you have provided.


On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
<zchl.jump@yahoo.com.invalid> wrote:
> Hi Alexander ,
>
> Thanks a lot for your comments.
>
> Spark seems not that stable when it comes to run big job, too much data or
> too much time, yes, the problem is gone when reducing the scale.
> Sometimes reset some job running parameter (such as --drive-memory may help
> in GC issue) , sometimes may rewrite the codes by applying other algorithm.
>
> As you commented the shuffle operation, it sounds some as the reason ...
>
> Best Wishes !
>
>
>
> On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kpavn@hotmail.com>
> wrote:
>
>
> Hi Zhiliang,
>
> Yes, find the exact reason of failure is very difficult. We have issue with
> similar behavior, due to limited time for investigation, we reduce the
> number of processed data, and problem has gone.
>
> Some points which may help you in investigations:
> ·         If you start spark-history-server (or monitoring running
> application on 4040 port), look into failed stages (if any). By default
> Spark try to retry stage execution 2 times, after that job fails
> ·         Some useful information may contains in yarn logs on Hadoop nodes
> (yarn-<user>-nodemanager-<host>.log), but this is only information about
> killed container, not about the reasons why this stage took so much memory
>
> As I can see in your logs, failed step relates to shuffle operation, could
> you change your job to avoid massive shuffle operation?
>
> --
> WBR, Alexander
>
> From: Zhiliang Zhu
> Sent: 17 июня 2016 г. 14:10
> To: User; kpavn@hotmail.com
> Subject: Re: spark job automatically killed without rhyme or reason
>
>
> Show original message
>
>
> Hi Alexander,
>
> is your yarn userlog   just for the executor log ?
>
> as for those logs seem a little difficult to exactly decide the wrong point,
> due to sometimes successful job may also have some kinds of the error  ...
> but will repair itself.
> spark seems not that stable currently     ...
>
> Thank you in advance~
>
>
>
> On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.jump@yahoo.com> wrote:
>
>
> Hi Alexander,
>
> Thanks a lot for your reply.
>
> Yes, submitted by yarn.
> Do you just mean in the executor log file by way of yarn logs -applicationId
> id,
>
> in this file, both in some containers' stdout  and stderr :
>
> 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive
> connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
> 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> ip-172-31-20-104/172.31.20.104:49991              <------ may it be due to
> that spark is not stable, and spark may repair itself for these kinds of
> error ? (saw some in successful run )
>
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> ............
> Caused by: java.net.ConnectException: Connection refused:
> ip-172-31-20-104/172.31.20.104:49991
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>         at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>
>
> 16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected;
> size = 16777216 bytes, TID = 100323           <-----       would it be
> memory leak issue? though no GC exception threw for other normal kinds of
> out of memory
> 16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage
> 112.0 (TID 100323)
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
>         at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)
> ...........
>
> sorry, there is some information in the middle of the log file, but all is
> okay at the end  part of the log .
> in the run log file as log_file generated by command:
> nohup spark-submit --driver-memory 20g  --num-executors 20 --class
> com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar
> doAnalysisExtremeLender  /tmp/drretention/test/output  0.96
> /tmp/drretention/evaluation/test_karthik/lgmodel
> /tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
> 50 > log_file
>
> executor 40 lost                        <------    would it be due to
this,
> sometimes job may fail for the reason
> ..........
>
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
>         at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)
> ..........
>
>
> Thanks in advance!
>
>
>
>
>
> On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kpavn@hotmail.com>
> wrote:
>
>
> Hi,
>
> Did you submit spark job via YARN? In some cases (memory configuration
> probably), yarn can kill containers where spark tasks are executed. In this
> situation, please check yarn userlogs for more information…
>
> --
> WBR, Alexander
>
> From: Zhiliang Zhu
> Sent: 17 июня 2016 г. 9:36
> To: Zhiliang Zhu; User
> Subject: Re: spark job automatically killed without rhyme or reason
>
> anyone ever met the similar problem, which is quite strange ...
>
>
> On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID>
> wrote:
>
>
> Hi All,
>
> I have a big job which mainly takes more than one hour to run the whole,
> however, it is very much unreasonable to exit & finish to run midway (almost
> 80% of the job finished actually, but not all),
> without any apparent error or exception log.
>
> I submitted the same job for many times, it is same as that.
> In the last line of the run log, just one word "killed" to end, or sometimes
> not any  other wrong log, all seems okay but should not finish.
>
> What is the way for the problem? Is there any other friends that ever met
> the similar issue ...
>
> Thanks in advance!
>
>
>
>
>
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org








        

  
Mime
View raw message