libcloud-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zahlenofzahlen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LIBCLOUD-532) deploy_node(..) occasionally fails on EC2
Date Fri, 28 Mar 2014 17:45:42 GMT

    [ https://issues.apache.org/jira/browse/LIBCLOUD-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951120#comment-13951120
] 

zahlenofzahlen commented on LIBCLOUD-532:
-----------------------------------------

For those experiencing the EOFError or similar race conditions with the SSH key injection
mechanism at the provider I also suggest modifying base.py to include a default 1.5 second
wait period in the deploy_node keyword args as deploy_node(user_wait_period=1.5, etc......)
user_wait_period = kwargs.get('user_wait_period', 1.5)

        for username in ([ssh_username] + ssh_alternate_usernames):
            try:
                time.sleep(user_wait_period)

otherwise the workaround will fire all 10 retries within a couple seconds and likely still
result in a failure. Given the differences in API response times per provider and even in
most cases depending on load I would think we would want to add retry logic around the connect,
sftp put, etc. each with a central tunable retry per provider and a wait_period timer that
grows exponentially to a known max value. 

> deploy_node(..) occasionally fails on EC2
> -----------------------------------------
>
>                 Key: LIBCLOUD-532
>                 URL: https://issues.apache.org/jira/browse/LIBCLOUD-532
>             Project: Libcloud
>          Issue Type: Bug
>          Components: Compute
>         Environment: apache-libcloud 0.14.1, Windows 7
>            Reporter: Stefan Müller
>
> h2. Observed behaviour:
> When I'm starting EC2 nodes with {{deploy_node(ssh_key=...)}} I occationally (about 50%
of the time) get a an error message indicating that my key is not a valid DSA key.
> This seems a bit odd, since I'm using an RSA key. 
> h2. Cause
> Turns out the cause is somewhere else:
> When starting a node, there is a short time during which the SSH daemon is already up
and running, but the public-key has not yet been put into the `authorized_keys` file. Apparently
the SSH daemon is started before Amazon's key-injection magic has finished.
> During this short time (I'd guess about a second) SSH is rejecting the private key, with
an authentication error.
> libcloud then tries some other means of authentication during which it apparently tries
to parse the key as a DSA key, causing the reported error.
> Note that the extra-long timeout used for the SSH connection attempt is not helping in
this case, since the SSH server is replying already.
> h2. Suggested Fix
> I suggest to react to a failed authentication with a few retries, with a second or two
delay between them. Similarly to {{wait_until_running()}}.
> h2. Workaround
> {code}
> deploy_node(...,ssh_alternate_usernames=["root" for _ in range(10)])
> {code}
> This causes libcloud to make several authentification attempts. It is slow enough to
delay until the public-key is in place. Solves the problem reliably, but not elegantly :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message