spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roland Johann <roland.joh...@phenetic.io.INVALID>
Subject Re: Spark job fails because of timeout to Driver
Date Fri, 04 Oct 2019 17:32:10 GMT
Hi Jochen,

Can you crate a small EMR cluster wirh all defaults and rhn the job there?
This way we can ensure that the issue is not infrastructure and YARN
configuration related.

Kind regards

Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019 um
19:27:

> Hi Roland,
>
> I switched to the default security groups, ran my job again but the same
> exception pops up :-( ...
> All traffic is open on the security groups now.
>
> Jochen
>
> Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <
> roland.johann@phenetic.io>:
>
>> This are dynamic port ranges and dependa on configuration of your
>> cluster. Per job there is a separate application master so there can‘t be
>> just one port.
>> If I remeber correctly the default EMR setup creates worker security
>> groups with unrestricted traffic within the group, e.g. Between the worker
>> nodes.
>> Depending on your security requirements I suggest that you start with a
>>  default like setup and determine ports and port ranges from the docs
>> afterwards to further restrict traffic between the nodes.
>>
>> Kind regards
>>
>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019
>> um 17:16:
>>
>>> Hi Roland,
>>>
>>> We have indeed custom security groups. Can you tell me where exactly I
>>> need to be able to access what?
>>> For example, is it from the master instance to the driver instance? And
>>> which port should be open?
>>>
>>> Jochen
>>>
>>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>>> roland.johann@phenetic.io>:
>>>
>>>> Ho Jochen,
>>>>
>>>> did you setup the EMR cluster with custom security groups? Can you
>>>> confirm that the relevant EC2 instances can connect through relevant ports?
>>>>
>>>> Best regards
>>>>
>>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt.
>>>> 2019 um 17:09:
>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>>
>>>>> Jochen
>>>>>
>>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjffdu@gmail.com>:
>>>>>
>>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>>> it is 100s)
>>>>>> Maybe you are doing some very time consuming operation when
>>>>>> initializing SparkContext, which cause timeout.
>>>>>>
>>>>>> See this property here
>>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>
>>>>>>
>>>>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> 于2019年10月4日周五
下午10:08写道:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a
Spark
>>>>>>> job towards the cluster. Thhe job gets accepted, but the YARN
application
>>>>>>> fails with:
>>>>>>>
>>>>>>>
>>>>>>> {code}
>>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
>>>>>>> [100000 milliseconds]
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>> at
>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED,
>>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>>> java.util.concurrent.TimeoutException: Futures timed out after
[100000
>>>>>>> milliseconds]
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>> at
>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>> at
>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>> at
>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>> {code}
>>>>>>>
>>>>>>> It actually goes wrong at this line:
>>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>>
>>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there
must be
>>>>>>> something wrong with my setup. I don't understand the code of
the
>>>>>>> ApplicationMaster, so could somebody explain me what it is trying
to reach?
>>>>>>> Where exactly does the connection timeout? So at least I can
debug it
>>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>>
>>>>>>> Thanks for any help!
>>>>>>> Jochen
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>> --
>>>>
>>>>
>>>> *Roland Johann*Software Developer/Data Engineer
>>>>
>>>> *phenetic GmbH*
>>>> Lütticher Straße 10, 50674 Köln, Germany
>>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>>
>>>> Mobil: +49 172 365 26 46
>>>> Mail: roland.johann@phenetic.io
>>>> Web: phenetic.io
>>>>
>>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>>
>>> --
>>
>>
>> *Roland Johann*Software Developer/Data Engineer
>>
>> *phenetic GmbH*
>> Lütticher Straße 10, 50674 Köln, Germany
>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>
>> Mobil: +49 172 365 26 46
>> Mail: roland.johann@phenetic.io
>> Web: phenetic.io
>>
>> Handelsregister: Amtsgericht Köln (HRB 92595)
>> Geschäftsführer: Roland Johann, Uwe Reimann
>>
> --


*Roland Johann*Software Developer/Data Engineer

*phenetic GmbH*
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.johann@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Mime
View raw message