whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Baclace <paul.bacl...@gmail.com>
Subject possible false-positive DOA on EC2 us-west-1 and WHIRR-378
Date Tue, 20 Sep 2011 00:04:22 GMT
[While describing the problem below I concluded it is a bug because it 
results in being charged for apparent dead on arrival nodes; it looks 
like WHIRR-378, and if it is the same problem, then WHIRR-378 needs a 
workaround.  The Following was added as a comment to WHIRR-378.]


I frequently see net.schmizz.sshj.userauth.UserAuthException when using 
Whirr 0.6.0 on EC2; there are many causes for an SSH error, and I dug 
into one run and found that 2 out of 5 nodes were seemingly dead on 
arrival.  That is a high rate, so I wonder whether it was a false 
positive DOA (in another case I saw 2 out of 7).  The trimmed whirr.log 
below shows (last 3 digits of i-number):

1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34

The most caused-by ssh error is 
"net.schmizz.sshj.userauth.UserAuthException: publickey auth failed".
It looks like the overall error "problem applying options to node" is 
occurring 10 seconds after opening the socket, so that node is alive to 
some extent and it does not appear to be an ssh timeout.  That this 
happens about 1 minute after instance start makes me think there could 
be an implicit timer awaiting boot-up.  (These instances are all using 
the same private ami from instance-store and no EBS volumes.)

The failed nodes appear to be deleted after sufficient nodes are started 
up, not when they are determined to be failed.  Looking at billing 
records, I noticed that I am being charged for these failed nodes, so I 
think this is an important bug.


-----whirr.log excerpt-------
03:37:19,043 DEBUG [jclouds.compute] << started 
instances([region=us-west-1, name=i-f9914fbe])
03:37:19,133 DEBUG [jclouds.compute] << present 
instances([region=us-west-1, name=i-f9914fbe])
03:37:19,332 DEBUG [jclouds.compute] << started 
instances([region=us-west-1, name=i-87914fc0],[region=us-west-1, 
name=i-85914fc2])
03:37:19,495 DEBUG [jclouds.compute] << present 
instances([region=us-west-1, name=i-87914fc0],[region=us-west-1, 
name=i-85914fc2])

03:38:46,153 ERROR [jclouds.compute] << problem applying options to 
node(us-west-1/i-85914fc2)

03:40:14,460 DEBUG [jclouds.compute] << started 
instances([region=us-west-1, name=i-5b8e501c])
03:40:14,547 DEBUG [jclouds.compute] << present 
instances([region=us-west-1, name=i-5b8e501c])

03:41:19,691 ERROR [jclouds.compute] << problem applying options to 
node(us-west-1/i-5b8e501c)

03:41:22,738 DEBUG [jclouds.compute] << started 
instances([region=us-west-1, name=i-078e5040])
03:41:22,831 DEBUG [jclouds.compute] << present 
instances([region=us-west-1, name=i-078e5040])
03:44:34,257 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  
Deleting failed node node us-west-1/i-5b8e501c
03:44:34,259 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  
Deleting failed node node us-west-1/i-85914fc2
03:46:27,948 INFO  [org.apache.whirr.service.FileClusterStateStore] 
(main) Wrote instances file instances

The instances file ends up containing:   i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c  i-85914fc2


Mime
View raw message