spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Re: Spark job fails because of timeout to Driver
Date Fri, 04 Oct 2019 17:27:07 GMT
Hi Roland,

I switched to the default security groups, ran my job again but the same
exception pops up :-( ...
All traffic is open on the security groups now.

Jochen

Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <roland.johann@phenetic.io
>:

> This are dynamic port ranges and dependa on configuration of your cluster.
> Per job there is a separate application master so there can‘t be just one
> port.
> If I remeber correctly the default EMR setup creates worker security
> groups with unrestricted traffic within the group, e.g. Between the worker
> nodes.
> Depending on your security requirements I suggest that you start with a
>  default like setup and determine ports and port ranges from the docs
> afterwards to further restrict traffic between the nodes.
>
> Kind regards
>
> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 17:16:
>
>> Hi Roland,
>>
>> We have indeed custom security groups. Can you tell me where exactly I
>> need to be able to access what?
>> For example, is it from the master instance to the driver instance? And
>> which port should be open?
>>
>> Jochen
>>
>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>> roland.johann@phenetic.io>:
>>
>>> Ho Jochen,
>>>
>>> did you setup the EMR cluster with custom security groups? Can you
>>> confirm that the relevant EC2 instances can connect through relevant ports?
>>>
>>> Best regards
>>>
>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt.
>>> 2019 um 17:09:
>>>
>>>> Hi Jeff,
>>>>
>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>
>>>> Jochen
>>>>
>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjffdu@gmail.com>:
>>>>
>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>> it is 100s)
>>>>> Maybe you are doing some very time consuming operation when
>>>>> initializing SparkContext, which cause timeout.
>>>>>
>>>>> See this property here
>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>
>>>>>
>>>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> 于2019年10月4日周五
下午10:08写道:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark
>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN application
>>>>>> fails with:
>>>>>>
>>>>>>
>>>>>> {code}
>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
>>>>>> [100000 milliseconds]
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>> at
>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>> java.util.concurrent.TimeoutException: Futures timed out after [100000
>>>>>> milliseconds]
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>> at
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>> at
>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>> {code}
>>>>>>
>>>>>> It actually goes wrong at this line:
>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>
>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there must
be
>>>>>> something wrong with my setup. I don't understand the code of the
>>>>>> ApplicationMaster, so could somebody explain me what it is trying
to reach?
>>>>>> Where exactly does the connection timeout? So at least I can debug
it
>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>
>>>>>> Thanks for any help!
>>>>>> Jochen
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>> --
>>>
>>>
>>> *Roland Johann*Software Developer/Data Engineer
>>>
>>> *phenetic GmbH*
>>> Lütticher Straße 10, 50674 Köln, Germany
>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>
>>> Mobil: +49 172 365 26 46
>>> Mail: roland.johann@phenetic.io
>>> Web: phenetic.io
>>>
>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>
>> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>

Mime
View raw message