whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Savu <savu.and...@gmail.com>
Subject Re: non-deterministic "Could not get lock /var/lib/dpkg/lock"
Date Fri, 03 Feb 2012 18:27:43 GMT
I have created the following issue for this:
https://issues.apache.org/jira/browse/WHIRR-501

On Fri, Feb 3, 2012 at 8:24 PM, Andrei Savu <savu.andrei@gmail.com> wrote:

> Good catch Karel! I have tried to investigate this in the past but I have
> never considered that it may be a race condition with a cron job (most of
> the synchronisation tests we've added are designed to prove that this is
> not a condition triggered by Whirr).
>
> What if we stop the crond service while running the install/configure
> scripts?
> http://www.cyberciti.biz/faq/howto-linux-unix-start-restart-cron/
>
>
>> In my opinion, as much of the installation/configuration steps should
>> be done using a config management tool (puppet/chef).
>>
>
> Totally agree + we have the needed infrastructure for this.
>
>
>> Once the configuration is published to each node you can trigger
>> puppet/chef it as much as you like, and eventually you should reach a
>> good state. Running the complete whirr-generated script(s) multiple
>> times is going to be slower and much more error prone.
>>
>
> + it's hard to make retry-friendly bash scripts.
>
>
>>
>> Regards,
>> Karel
>>
>> On Mon, Oct 3, 2011 at 10:22 PM, Paul Baclace <paul.baclace@gmail.com>
>> wrote:
>> > Two runs of whirr on EC2 yesterday randomly failed to install Hadoop
>> > components.  First it occurred on the master node, but when it occurred
>> in
>> > one slave and not another, I could find the diff of the /tmp/logs/ from
>> > jclouds.  In a third run, everything worked fine.  Same scripts driving
>> > whirr, same AMI, same number of nodes, same region, etc. Snippets of
>> > /tmp/logs/stderr.log shown below indicate that apt-get update had
>> "Could not
>> > get lock /var/lib/dpkg/lock" on one slave, but not another.
>> >
>> > This is a serious reliability issue.  What is non-deterministic here?
>> >
>> > Paul
>> >
>> > ------------ slave 1 -------------------
>> > + register_cloudera_repo
>> > + which dpkg
>> > + cat
>> > + curl -s http://archive.cloudera.com/debian/archive.key
>> > + sudo apt-key add -
>> > + sudo apt-get update
>> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
>> temporarily
>> > unavailable)
>> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
>> another
>> > process using it?
>> > + which dpkg
>> > + apt-get update
>> > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
>> temporarily
>> > unavailable)
>> > E: Unable to lock the administration directory (/var/lib/dpkg/), is
>> another
>> > process using it?
>> > + apt-get -y install hadoop-0.20
>> >
>> > -------------- slave 2 ---------------
>> > + register_cloudera_repo
>> > + which dpkg
>> > + cat
>> > + curl -s http://archive.cloudera.com/debian/archive.key
>> > + sudo apt-key add -
>> > + sudo apt-get update
>> > + which dpkg
>> > + apt-get update
>> > + apt-get -y install hadoop-0.20
>> > dpkg-preconfigure: unable to re-open stdin:
>> > + cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
>> > + update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf
>> > /etc/hadoop-0.20/conf.dist 90
>> > + install_cdh_hbase -c aws-ec2 -u
>> > http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz
>> >
>> > -------------
>>
>>
>>
>> --
>> Karel Vervaeke
>> http://outerthought.org/
>> Open Source Content Applications
>> Makers of Kauri, Daisy CMS and Lily
>>
>
>

Mime
View raw message