spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Re: Spark job fails because of timeout to Driver
Date Sun, 06 Oct 2019 16:09:05 GMT
Hi Igor,

No, it was not a memory issue - but thanks for your question. Could have
been a resources problem indeed :-)

Jochen

Op vr 4 okt. 2019 om 19:51 schreef igor cabral uchoa <
igoruchoa5e@yahoo.com.br>:

> Maybe it is a basic question, but your cluster has enough resource to run
> your application? It is requesting 208G of RAM
>
> Thanks,
>
> Sent from Yahoo Mail for iPhone
> <https://overview.mail.yahoo.com/?.src=iOS>
>
> On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht <
> jochenhebbrecht@gmail.com> wrote:
>
> Hi Igor,
>
> We are deploying by submitting a batch job on a Livy server (from our
> local PC or a Jenkins node). The Livy server then deploys the Spark job on
> the cluster itself.
>
> For example:
> ---
>
> Running '/usr/lib/spark/bin/spark-submit' '--class' '##MY_MAIN_CLASS##' '--conf' 'spark.driver.userClassPathFirst=true'
'--conf' 'spark.default.parallelism=180' '--conf' 'spark.executor.memory=52g' '--conf' 'spark.driver.memory=52g'
'--conf' 'spark.yarn.tags=livy-batch-0-owjPBdmC' '--conf' 'spark.executor.instances=3' '--conf'
'spark.executor.memoryOverhead=6144' '--conf' 'spark.driver.cores=6' '--conf' 'spark.driver.memoryOverhead=6144'
'--conf' 'spark.executor.extraJavaOptions=-XX:ThreadStackSize=2048 -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled
-XX:OnOutOfMemoryError=\'kill -9 %p\'' '--conf' 'spark.executor.userClassPathFirst=true' '--conf'
'spark.submit.deployMode=cluster' '--conf' 'spark.yarn.submit.waitAppCompletion=false' '--conf'
'spark.executor.extraClassPath=true' '-- ...
>
> ---
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:42 schreef igor cabral uchoa <
> igoruchoa5e@yahoo.com.br>:
>
> Hi Roland!
>
> What deploy mode are you using when you submit your applications? It is
> client or cluster mode?
>
> Regards,
>
>
> Sent from Yahoo Mail for iPhone
> <https://overview.mail.yahoo.com/?.src=iOS>
>
> On Friday, October 4, 2019, 12:37 PM, Roland Johann
> <roland.johann@phenetic.io.INVALID> wrote:
>
> This are dynamic port ranges and dependa on configuration of your cluster.
> Per job there is a separate application master so there can‘t be just one
> port.
> If I remeber correctly the default EMR setup creates worker security
> groups with unrestricted traffic within the group, e.g. Between the worker
> nodes.
> Depending on your security requirements I suggest that you start with a
>  default like setup and determine ports and port ranges from the docs
> afterwards to further restrict traffic between the nodes.
>
> Kind regards
>
> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:16:
>
> Hi Roland,
>
> We have indeed custom security groups. Can you tell me where exactly I
> need to be able to access what?
> For example, is it from the master instance to the driver instance? And
> which port should be open?
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
> Ho Jochen,
>
> did you setup the EMR cluster with custom security groups? Can you confirm
> that the relevant EC2 instances can connect through relevant ports?
>
> Best regards
>
> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:09:
>
> Hi Jeff,
>
> Thanks! Just tried that, but the same timeout occurs :-( ...
>
> Jochen
>
> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjffdu@gmail.com>:
>
> You can try to increase property spark.yarn.am.waitTime (by default it is
> 100s)
> Maybe you are doing some very time consuming operation when initializing
> SparkContext, which cause timeout.
>
> See this property here
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
>
> Jochen Hebbrecht <jochenhebbrecht@gmail.com> 于2019年10月4日周五 下午10:08写道:
>
> Hi,
>
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job
> towards the cluster. Thhe job gets accepted, but the YARN application fails
> with:
>
>
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
> exitCode: 13, (reason: Uncaught exception:
> java.util.concurrent.TimeoutException: Futures timed out after [100000
> milliseconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
> at org.apache.spark.deploy.yarn.ApplicationMaster.org
> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
>
> It actually goes wrong at this line:
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be
> something wrong with my setup. I don't understand the code of the
> ApplicationMaster, so could somebody explain me what it is trying to reach?
> Where exactly does the connection timeout? So at least I can debug it
> further because I don't have a clue what it is doing :-)
>
> Thanks for any help!
> Jochen
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46 <+49%20172%20365%2026%2046>
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>
>

Mime
View raw message