Hi Roland,

I just tried what you've suggested and it actually helped me finding the root cause. Once I had the default EMR cluster, I've submitted a Spark job using the master instance (using the 'spark-submit' command on a terminal) - and not use Livy to submit this job.
In this way, I had much more logging in the terminal and now the logging actually indicated me what the timeout was causing. The timeout was related to a service call in our company and this service call failed due to access constraints.

Fixing those access constraints, made the Spark job succeed!

So conclusion: nothing related to Spark itself, but it's the Livy output logging which was hiding the real error details.

Thank you all for help! :-)

Jochen

Op vr 4 okt. 2019 om 19:32 schreef Roland Johann <roland.johann@phenetic.io>:
Hi Jochen,

Can you crate a small EMR cluster wirh all defaults and rhn the job there? This way we can ensure that the issue is not infrastructure and YARN configuration related.

Kind regards

Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019 um 19:27:
Hi Roland,

I switched to the default security groups, ran my job again but the same exception pops up :-( ...
All traffic is open on the security groups now.

Jochen

Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <roland.johann@phenetic.io>:
This are dynamic port ranges and dependa on configuration of your cluster. Per job there is a separate application master so there can‘t be just one port.
If I remeber correctly the default EMR setup creates worker security groups with unrestricted traffic within the group, e.g. Between the worker nodes.
Depending on your security requirements I suggest that you start with a  default like setup and determine ports and port ranges from the docs afterwards to further restrict traffic between the nodes.

Kind regards

Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:16:
Hi Roland,

We have indeed custom security groups. Can you tell me where exactly I need to be able to access what?
For example, is it from the master instance to the driver instance? And which port should be open?

Jochen

Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <roland.johann@phenetic.io>:
Ho Jochen,

did you setup the EMR cluster with custom security groups? Can you confirm that the relevant EC2 instances can connect through relevant ports?

Best regards

Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019 um 17:09:
Hi Jeff,

Thanks! Just tried that, but the same timeout occurs :-( ...

Jochen

Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjffdu@gmail.com>:
You can try to increase property spark.yarn.am.waitTime (by default it is 100s)  
Maybe you are doing some very time consuming operation when initializing SparkContext, which cause timeout.



Jochen Hebbrecht <jochenhebbrecht@gmail.com> 于2019年10月4日周五 下午10:08写道:

Hi,

I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job towards the cluster. Thhe job gets accepted, but the YARN application fails with:


{code}
19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
{code}

It actually goes wrong at this line: https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468

Now, I'm 100% sure Spark is OK and there's no bug, but there must be something wrong with my setup. I don't understand the code of the ApplicationMaster, so could somebody explain me what it is trying to reach? Where exactly does the connection timeout? So at least I can debug it further because I don't have a clue what it is doing :-)

Thanks for any help!
Jochen



--
Best Regards

Jeff Zhang
--

Roland Johann

Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
--

Roland Johann

Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
--

Roland Johann

Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann