spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Re: Spark job fails because of timeout to Driver
Date Sun, 06 Oct 2019 16:08:24 GMT
Hi Roland,

I just tried what you've suggested and it actually helped me finding the
root cause. Once I had the default EMR cluster, I've submitted a Spark job
using the master instance (using the 'spark-submit' command on a terminal)
- and not use Livy to submit this job.
In this way, I had much more logging in the terminal and now the logging
actually indicated me what the timeout was causing. The timeout was related
to a service call in our company and this service call failed due to access
constraints.

Fixing those access constraints, made the Spark job succeed!

So conclusion: nothing related to Spark itself, but it's the Livy output
logging which was hiding the real error details.

Thank you all for help! :-)

Jochen

Op vr 4 okt. 2019 om 19:32 schreef Roland Johann <roland.johann@phenetic.io
>:

> Hi Jochen,
>
> Can you crate a small EMR cluster wirh all defaults and rhn the job there?
> This way we can ensure that the issue is not infrastructure and YARN
> configuration related.
>
> Kind regards
>
> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt. 2019
> um 19:27:
>
>> Hi Roland,
>>
>> I switched to the default security groups, ran my job again but the same
>> exception pops up :-( ...
>> All traffic is open on the security groups now.
>>
>> Jochen
>>
>> Op vr 4 okt. 2019 om 17:37 schreef Roland Johann <
>> roland.johann@phenetic.io>:
>>
>>> This are dynamic port ranges and dependa on configuration of your
>>> cluster. Per job there is a separate application master so there can‘t be
>>> just one port.
>>> If I remeber correctly the default EMR setup creates worker security
>>> groups with unrestricted traffic within the group, e.g. Between the worker
>>> nodes.
>>> Depending on your security requirements I suggest that you start with a
>>>  default like setup and determine ports and port ranges from the docs
>>> afterwards to further restrict traffic between the nodes.
>>>
>>> Kind regards
>>>
>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4. Okt.
>>> 2019 um 17:16:
>>>
>>>> Hi Roland,
>>>>
>>>> We have indeed custom security groups. Can you tell me where exactly I
>>>> need to be able to access what?
>>>> For example, is it from the master instance to the driver instance? And
>>>> which port should be open?
>>>>
>>>> Jochen
>>>>
>>>> Op vr 4 okt. 2019 om 17:14 schreef Roland Johann <
>>>> roland.johann@phenetic.io>:
>>>>
>>>>> Ho Jochen,
>>>>>
>>>>> did you setup the EMR cluster with custom security groups? Can you
>>>>> confirm that the relevant EC2 instances can connect through relevant
ports?
>>>>>
>>>>> Best regards
>>>>>
>>>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> schrieb am Fr. 4.
Okt.
>>>>> 2019 um 17:09:
>>>>>
>>>>>> Hi Jeff,
>>>>>>
>>>>>> Thanks! Just tried that, but the same timeout occurs :-( ...
>>>>>>
>>>>>> Jochen
>>>>>>
>>>>>> Op vr 4 okt. 2019 om 16:37 schreef Jeff Zhang <zjffdu@gmail.com>:
>>>>>>
>>>>>>> You can try to increase property spark.yarn.am.waitTime (by default
>>>>>>> it is 100s)
>>>>>>> Maybe you are doing some very time consuming operation when
>>>>>>> initializing SparkContext, which cause timeout.
>>>>>>>
>>>>>>> See this property here
>>>>>>> http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>
>>>>>>>
>>>>>>> Jochen Hebbrecht <jochenhebbrecht@gmail.com> 于2019年10月4日周五
>>>>>>> 下午10:08写道:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send
a Spark
>>>>>>>> job towards the cluster. Thhe job gets accepted, but the
YARN application
>>>>>>>> fails with:
>>>>>>>>
>>>>>>>>
>>>>>>>> {code}
>>>>>>>> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception:
>>>>>>>> java.util.concurrent.TimeoutException: Futures timed out
after
>>>>>>>> [100000 milliseconds]
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>>> at
>>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>>> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status:
FAILED,
>>>>>>>> exitCode: 13, (reason: Uncaught exception:
>>>>>>>> java.util.concurrent.TimeoutException: Futures timed out
after [100000
>>>>>>>> milliseconds]
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>>>>>>> at
>>>>>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>>>>>>> at
>>>>>>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>>>>>>>> at org.apache.spark.deploy.yarn.ApplicationMaster.org
>>>>>>>> $apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>>>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>>>>>>>> at
>>>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>>>>>>> {code}
>>>>>>>>
>>>>>>>> It actually goes wrong at this line:
>>>>>>>> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
>>>>>>>>
>>>>>>>> Now, I'm 100% sure Spark is OK and there's no bug, but there
must
>>>>>>>> be something wrong with my setup. I don't understand the
code of the
>>>>>>>> ApplicationMaster, so could somebody explain me what it is
trying to reach?
>>>>>>>> Where exactly does the connection timeout? So at least I
can debug it
>>>>>>>> further because I don't have a clue what it is doing :-)
>>>>>>>>
>>>>>>>> Thanks for any help!
>>>>>>>> Jochen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>> --
>>>>>
>>>>>
>>>>> *Roland Johann*Software Developer/Data Engineer
>>>>>
>>>>> *phenetic GmbH*
>>>>> Lütticher Straße 10, 50674 Köln, Germany
>>>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>>>
>>>>> Mobil: +49 172 365 26 46
>>>>> Mail: roland.johann@phenetic.io
>>>>> Web: phenetic.io
>>>>>
>>>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>>>
>>>> --
>>>
>>>
>>> *Roland Johann*Software Developer/Data Engineer
>>>
>>> *phenetic GmbH*
>>> Lütticher Straße 10, 50674 Köln, Germany
>>> <https://www.google.com/maps/search/L%C3%BCtticher+Stra%C3%9Fe+10,+50674+K%C3%B6ln,+Germany?entry=gmail&source=g>
>>>
>>> Mobil: +49 172 365 26 46
>>> Mail: roland.johann@phenetic.io
>>> Web: phenetic.io
>>>
>>> Handelsregister: Amtsgericht Köln (HRB 92595)
>>> Geschäftsführer: Roland Johann, Uwe Reimann
>>>
>> --
>
>
> *Roland Johann*Software Developer/Data Engineer
>
> *phenetic GmbH*
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>

Mime
View raw message