whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiss Tibor <kiss.ti...@gmail.com>
Subject Re: Error When Starting Cluster
Date Fri, 17 Dec 2010 10:42:01 GMT
Yestarday I experienced the same issue starting a 12 dn+tt configuration on
ec2.
Some of the instances has been started up correctly but at least one failed
with the same stacktrace and the entire cluster become unusable.
I tried to ssh into the instance but I couldn't. I also tried to insert the
key manually using Elasticfox but I couldn't.
I was looking into the system startup console output, nothing special I
found there.
Looks like an Amazon EC2 instance startup issue.

After a half an hour investigation I terminated all of the instances,
because the entire process didn't finished al least until in a stage to be
able to repair manually the remaining nodes.
Then I initiated a new cluster startup with the same parameters (I use a
CDH3b3 except the whirr-0.3.0 from the trunk (rev nr: 1044476)).
Next time I had problems with two instances, but something different. If I
see this stackstrace (we are speaking about) then I know that there is no
chance to easily repair the cluster, or at least I cannot do it.  But now,
the second time I was experienced that two nodes didn't run the
/tmp/compuserv/compuserv.sh script... so I run it manually, so I could use
my cluster with 12 m1.xlarge nodes.

Actually it is very unstable the cluster startup process on Amazon EC2
instances. How the number of nodes to be started up is increasing the
startup process it fails more often. But sometimes even 2-3 nodes startup
process fails. We don't know how many number of instance startup is going on
at the same time at Amazon side when it fails or when it successfully
starting up. The only think I see is that when I am starting around 10
nodes, the statistics of failing nodes are higher then with smaller number
of nodes and is not direct proportional with the number of nodes, looks like
it is exponentialy higher probability to fail some nodes.

Maybe the main cause it is related to some weird EC2 phenomenon, but
together with the current coordination given by whirr the problems are
somewhat poorly handled.
There are two things we can do it.
1. Investigate what exactly is behind the instability. is it really an EC2
issue only?
2. Redesign the startup coordinating process to be able to isolate and
repair or evict failing nodes. Then allow adding new nodes later, or
entirely be able to configure a more longer startup process otherwise we are
not able to start large clusters on ec2.

Tibor

On Thu, Dec 16, 2010 at 12:45 AM, Tom White <tom@cloudera.com> wrote:

> This looks like a keypair problem. Can you SSH into the instance with
> the private key?
>
> Also, which version are you using?
>
> It's not currently possible to have different sizes for master and
> worker, but t1.micro is too small to run Hadoop so you'd be best using
> a larger instance for the worker nodes. See
> https://issues.apache.org/jira/browse/WHIRR-148 for some guidance on
> the settings for this.
>
> Tom
>
> On Wed, Dec 15, 2010 at 2:26 PM, Shaun Martinec <smartinec@gmail.com>
> wrote:
> > I'm attempting to start a cluster with 1 master and 2 slave nodes. The
> > master node gets created, but then there are some java errors that get
> > thrown and it stops getting created. Any help is appreciated. Also,
> > I'm using t1.micro for slave nodes, but I would like to use a larger
> > node for the master - is this possible?
> >
> > -Shaun
> >
> > Here is my config:
> >
> > whirr.service-name=hadoop
> > whirr.cluster-name=tsmcluster
> > whirr.instance-templates=1 jt+nn,2 dn+tt
> > whirr.provider=ec2
> > whirr.identity=XXXXX
> > whirr.credential=XXXXX
> > whirr.private-key-file=${sys:user.home}/keys/admin-dev.pem
> > whirr.public-key-file=${sys:user.home}/keys/admin-dev.pub
> > whirr.hadoop-install-runurl=cloudera/cdh/install
> > whirr.hadoop-configure-runurl=cloudera/cdh/post-configure
> > whirr.hardware-id=t1.micro
> > whirr.location-id=us-east-1
> >
> > Here is my whirr.log
> >
> > 2010-12-15 16:13:03,359 INFO
> > [org.apache.whirr.service.hadoop.HadoopService] (main) Launching
> > tsmcluster cluster
> > 2010-12-15 16:13:06,905 INFO
> > [org.apache.whirr.service.hadoop.HadoopService] (main) Configuring
> > template
> > 2010-12-15 16:16:05,527 INFO
> > [org.apache.whirr.service.hadoop.HadoopService] (main) Launching
> > tsmcluster cluster
> > 2010-12-15 16:16:09,496 INFO
> > [org.apache.whirr.service.hadoop.HadoopService] (main) Configuring
> > template
> > 2010-12-15 16:16:12,174 DEBUG [jclouds.compute] (main) >> searching
> > params([biggest=false, fastest=false, imageName=null,
> > imageDescription=null, imageId=null, imageVersion=null,
> > location=[id=us-east-1, scope=REGION, description=us-east-1,
> > parent=ec2], minCores=0.0, minRam=0, osFamily=amzn-linux, osName=null,
> > osDescription=null, osVersion=null, osArch=null, os64Bit=null,
> > hardwareId=t1.micro])
> > 2010-12-15 16:16:12,179 DEBUG [jclouds.compute] (main) >> providing
> images
> > 2010-12-15 16:16:15,969 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale_images/CentOS5V1_10.img.manifest.xml)
> > 2010-12-15 16:16:15,970 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale_images/CentOS5_0V4_0_1.img.manifest.xml)
> > 2010-12-15 16:16:16,017 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale_images/CentOS5_2V4_0_2_Beta.manifest.xml)
> > 2010-12-15 16:16:16,018 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-i386-20090926.manifest.xml)
> > 2010-12-15 16:16:16,043 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-x86_64-20090926.manifest.xml)
> > 2010-12-15 16:16:16,048 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-x86_64-alpha6.manifest.xml)
> > 2010-12-15 16:16:16,067 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-x86_64-20090926.1.manifest.xml)
> > 2010-12-15 16:16:16,070 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-i386-20090926.1.manifest.xml)
> > 2010-12-15 16:16:16,098 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale-us/Ubuntu8.04_V4_3_4_Alpha.manifest.xml)
> > 2010-12-15 16:16:16,104 DEBUG [jclouds.compute] (main) << didn't match
> > at all(ubuntu-alphas-us/karmic-i386-alpha5.manifest.xml)
> > 2010-12-15 16:16:16,121 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-x86_64-alpha4.manifest.xml)
> > 2010-12-15 16:16:16,154 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-cloud-us/karmic-i386-beta.manifest.xml)
> > 2010-12-15 16:16:16,180 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-i386-alpha4.manifest.xml)
> > 2010-12-15 16:16:16,181 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale_images/CentOS5_0V2_0_1.img.manifest.xml)
> > 2010-12-15 16:16:16,183 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-cloud-us/karmic-x86_64-beta.manifest.xml)
> > 2010-12-15 16:16:16,232 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 9.10 i386 -
> > 20100407)
> > 2010-12-15 16:16:16,267 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-i386-20090929.manifest.xml)
> > 2010-12-15 16:16:16,282 DEBUG [jclouds.compute] (main) << didn't match
> > at all(ubuntu-alphas-us/karmic-x86_64-alpha5.manifest.xml)
> > 2010-12-15 16:16:16,283 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-testing-us/karmic-x86_64-20090929.manifest.xml)
> > 2010-12-15 16:16:16,289 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-x86_64-alpha5.1.manifest.xml)
> > 2010-12-15 16:16:16,312 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-i386-alpha5.1.manifest.xml)
> > 2010-12-15 16:16:16,330 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-cloud-test/ubuntu-jaunty-test.manifest.xml)
> > 2010-12-15 16:16:16,339 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 10.04 LTS i386 -
> > 20100520)
> > 2010-12-15 16:16:16,369 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale-us/CentOS5_2V4_1_10.manifest.xml)
> > 2010-12-15 16:16:16,404 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale_images/CentOS5_0V3_0_0.img.manifest.xml)
> > 2010-12-15 16:16:16,409 DEBUG [jclouds.compute] (main) << didn't match
> > at all(411009282317/Windows_2003_i386_Bare)
> > 2010-12-15 16:16:16,479 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-us/karmic-i386-alpha6.manifest.xml)
> > 2010-12-15 16:16:16,727 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 10.04 LTS i386 -
> > 20100520)
> > 2010-12-15 16:16:16,782 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 9.10 i386 -
> > 20100407)
> > 2010-12-15 16:16:16,989 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 10.04 LTS i386 -
> > 20100520)
> > 2010-12-15 16:16:17,098 DEBUG [jclouds.compute] (main) << didn't match
> > at all(amazon/EC2 CentOS 5.4 HVM AMI)
> > 2010-12-15 16:16:17,225 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-x86_64-alpha4.manifest.xml)
> > 2010-12-15 16:16:17,265 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale-eu/CentOS5_0V4_0_1.img.manifest.xml)
> > 2010-12-15 16:16:17,276 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-x86_64-alpha6.manifest.xml)
> > 2010-12-15 16:16:17,278 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-cloud-eu/karmic-i386-beta.manifest.xml)
> > 2010-12-15 16:16:17,279 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-cloud-eu/karmic-x86_64-beta.manifest.xml)
> > 2010-12-15 16:16:17,280 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-i386-alpha6.manifest.xml)
> > 2010-12-15 16:16:17,311 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-i386-alpha4.manifest.xml)
> > 2010-12-15 16:16:17,332 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale-eu/CentOS5_2V4_1_10.manifest.xml)
> > 2010-12-15 16:16:17,359 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 9.10 i386 -
> > 20100407)
> > 2010-12-15 16:16:17,362 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-i386-alpha5.manifest.xml)
> > 2010-12-15 16:16:17,406 DEBUG [jclouds.compute] (main) << didn't match
> > at all(canonical-alphas-eu/karmic-x86_64-alpha5.manifest.xml)
> > 2010-12-15 16:16:17,412 DEBUG [jclouds.compute] (main) << didn't match
> > at all(rightscale-eu/CentOS5_0V3_0_0.img.manifest.xml)
> > 2010-12-15 16:16:17,516 DEBUG [jclouds.compute] (main) << didn't match
> > at all(099720109477/DB2 Express-C 9.7.1 on Ubuntu 10.04 LTS i386 -
> > 20100520)
> > 2010-12-15 16:16:17,547 DEBUG [jclouds.compute] (main) << images(2088)
> > 2010-12-15 16:16:17,603 DEBUG [jclouds.compute] (main) <<   matched
> > hardware([id=t1.micro, providerId=t1.micro, name=t1.micro,
> > processors=[[cores=1.0, speed=1.0]], ram=630, volumes=[],
> > supportsImage=hasRootDeviceType(ebs)])
> > 2010-12-15 16:16:17,616 DEBUG [jclouds.compute] (main) <<   matched
> > image([id=us-east-1/ami-74f0061d, name=null,
> > operatingSystem=[name=null, family=amzn-linux, version=2010.11.1-beta,
> > arch=paravirtual, is64Bit=true,
> > description=137112412989/amzn-ami-2010.11.1-beta.x86_64-ebs],
> > description=Amazon, version=2010.11.1-beta, location=[id=us-east-1,
> > scope=REGION, description=us-east-1, parent=ec2]])
> > 2010-12-15 16:16:17,616 INFO
> > [org.apache.whirr.service.hadoop.HadoopService] (main) Starting master
> > node
> > 2010-12-15 16:16:17,616 DEBUG [jclouds.compute] (main) >> running 1
> > node tag(tsmcluster) location(us-east-1) image(us-east-1/ami-74f0061d)
> > hardwareProfile(t1.micro) options([groupIds=[], keyPair=null,
> > noKeyPair=false, placementGroup=null, noPlacementGroup=false,
> > monitoringEnabled=false, inboundPorts=[22], privateKey=true,
> > publicKey=true, runScript=true, port:seconds=-1:-1, subnetId=null,
> > metadata/details: false])
> > 2010-12-15 16:16:17,617 DEBUG [jclouds.compute] (main) >> creating
> > keyPair region(us-east-1) tag(tsmcluster)
> > 2010-12-15 16:16:20,936 DEBUG [jclouds.compute] (main) << created
> > keyPair(jclouds#tsmcluster#us-east-1#46)
> > 2010-12-15 16:16:20,936 DEBUG [jclouds.compute] (main) >> creating
> > securityGroup region(us-east-1) name(jclouds#tsmcluster#us-east-1)
> > 2010-12-15 16:16:21,138 DEBUG [jclouds.compute] (main) << created
> > securityGroup(jclouds#tsmcluster#us-east-1)
> > 2010-12-15 16:16:21,138 DEBUG [jclouds.compute] (main) >> authorizing
> > securityGroup region(us-east-1) name(jclouds#tsmcluster#us-east-1)
> > port(22)
> > 2010-12-15 16:16:21,211 DEBUG [jclouds.compute] (main) << authorized
> > securityGroup(jclouds#tsmcluster#us-east-1)
> > 2010-12-15 16:16:21,211 DEBUG [jclouds.compute] (main) >> authorizing
> > securityGroup region(us-east-1) name(jclouds#tsmcluster#us-east-1)
> > permission to itself
> > 2010-12-15 16:16:21,336 DEBUG [jclouds.compute] (main) << authorized
> > securityGroup(jclouds#tsmcluster#us-east-1)
> > 2010-12-15 16:16:21,336 DEBUG [jclouds.compute] (main) >> running 1
> > instance region(us-east-1) zone(null) ami(ami-74f0061d)
> > params({InstanceType=[t1.micro], AdditionalInfo=[tsmcluster],
> > SecurityGroup.1=[jclouds#tsmcluster#us-east-1],
> > KeyName=[jclouds#tsmcluster#us-east-1#46]})
> > 2010-12-15 16:16:21,663 DEBUG [jclouds.compute] (main) << started
> > instances(i-3895ad55)
> > 2010-12-15 16:16:21,777 DEBUG [jclouds.compute] (main) << present
> > instances(i-3895ad55)
> > 2010-12-15 16:20:40,096 DEBUG [jclouds.compute] (user thread 7) >>
> > authorizing rsa public key for ec2-user@184.73.112.206
> > 2010-12-15 16:20:40,405 DEBUG [jclouds.compute] (user thread 6) <<
> > initialized(0)
> > 2010-12-15 16:20:40,405 DEBUG [jclouds.compute] (user thread 6) >>
> > running [sudo ./runscript start] as ec2-user@184.73.112.206
> > 2010-12-15 16:20:40,454 DEBUG [jclouds.compute] (user thread 7) <<
> complete(0)
> > 2010-12-15 16:20:43,129 DEBUG [jclouds.compute] (user thread 6) <<
> start(0)
> > 2010-12-15 16:23:56,322 ERROR [jclouds.compute] (user thread 6) ssh,
> > completed: 1/2, errors: 1, rate: 105689ms/op
> > java.util.concurrent.ExecutionException: org.jclouds.ssh.SshException:
> > ec2-user@184.73.112.206:22: Error executing command: ./runscript
> > status
> >        at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> >        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >        at
> org.jclouds.concurrent.FutureIterables$1.run(FutureIterables.java:121)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:619)
> > Caused by: org.jclouds.ssh.SshException: ec2-user@184.73.112.206:22:
> > Error executing command: ./runscript status
> >        at org.jclouds.ssh.jsch.JschSshClient.exec(JschSshClient.java:287)
> >        at
> org.jclouds.compute.predicates.ScriptStatusReturnsZero.refresh(ScriptStatusReturnsZero.java:57)
> >        at
> org.jclouds.compute.predicates.ScriptStatusReturnsZero.apply(ScriptStatusReturnsZero.java:48)
> >        at
> org.jclouds.compute.predicates.ScriptStatusReturnsZero.apply(ScriptStatusReturnsZero.java:37)
> >        at
> com.google.common.base.Predicates$NotPredicate.apply(Predicates.java:335)
> >        at
> org.jclouds.predicates.RetryablePredicate.apply(RetryablePredicate.java:64)
> >        at
> org.jclouds.compute.callables.RunScriptOnNode.call(RunScriptOnNode.java:103)
> >        at
> org.jclouds.compute.callables.RunScriptOnNode.call(RunScriptOnNode.java:53)
> >        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        ... 3 more
> > Caused by: com.jcraft.jsch.JSchException: channel is not opened.
> >        at com.jcraft.jsch.Channel.connect(Channel.java:188)
> >        at com.jcraft.jsch.Channel.connect(Channel.java:144)
> >        at org.jclouds.ssh.jsch.JschSshClient.exec(JschSshClient.java:275)
> >        ... 12 more
> > 2010-12-15 16:23:56,324 ERROR [jclouds.compute] (user thread 3) ssh,
> > completed: 1/2, errors: 1, rate: 105691ms/op
> > java.lang.RuntimeException: ssh, completed: 1/2, errors: 1, rate:
> 105691ms/op
> >        at
> org.jclouds.concurrent.FutureIterables.awaitCompletion(FutureIterables.java:139)
> >        at
> org.jclouds.compute.util.ComputeUtils.runCallablesUsingSshClient(ComputeUtils.java:245)
> >        at
> org.jclouds.compute.util.ComputeUtils.runTasksUsingSshClient(ComputeUtils.java:214)
> >        at
> org.jclouds.compute.util.ComputeUtils.runCallablesOnNode(ComputeUtils.java:203)
> >        at
> org.jclouds.compute.util.ComputeUtils.runOptionsOnNode(ComputeUtils.java:151)
> >        at
> org.jclouds.compute.util.ComputeUtils$1.call(ComputeUtils.java:116)
> >        at
> org.jclouds.compute.util.ComputeUtils$1.call(ComputeUtils.java:112)
> >        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:619)
> > 2010-12-15 16:23:56,327 ERROR [jclouds.compute] (user thread 3) <<
> > problem applying options to node(us-east-1/i-3895ad55):
> > java.lang.RuntimeException: error invoking callables on nodes:
> > {org.jclouds.compute.callables.RunScriptOnNode@aae8a
> =java.util.concurrent.ExecutionException:
> > org.jclouds.ssh.SshException: ec2-user@184.73.112.206:22: Error
> > executing command: ./runscript status}
> >        at
> org.jclouds.compute.util.ComputeUtils.runCallablesUsingSshClient(ComputeUtils.java:247)
> >        at
> org.jclouds.compute.util.ComputeUtils.runTasksUsingSshClient(ComputeUtils.java:214)
> >        at
> org.jclouds.compute.util.ComputeUtils.runCallablesOnNode(ComputeUtils.java:203)
> >        at
> org.jclouds.compute.util.ComputeUtils.runOptionsOnNode(ComputeUtils.java:151)
> >        at
> org.jclouds.compute.util.ComputeUtils$1.call(ComputeUtils.java:116)
> >        at
> org.jclouds.compute.util.ComputeUtils$1.call(ComputeUtils.java:112)
> >        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:619)
> >
>

Mime
View raw message