whirr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Savu <savu.and...@gmail.com>
Subject Re: AMIs to use when creating hadoop cluster with whirr
Date Wed, 05 Oct 2011 17:25:08 GMT
>From here:
http://developer.yahoo.com/hadoop/tutorial/module7.html

"With multiple racks of servers, RPC timeouts may become more frequent. The
NameNode takes a continual census of DataNodes and their health via
heartbeat messages sent every few seconds. A similar timeout mechanism
exists on the MapReduce side with the JobTracker. With many racks of
machines, they may force one another to timeout because the master node is
not handling them fast enough. The following options increase the number of
threads on the master machine dedicated to handling RPC's from slave nodes:"
( I think this is also true for AWS)

Proposed solution is:


 <property>
    <name>dfs.namenode.handler.count</name>
    <value>40</value>
  </property>
  <property>
    <name>mapred.job.tracker.handler.count</name>

    <value>40</value>
  </property>


You can do this in Whirr by specifying:

hadoop-dfs.dfs.namenode.handler.count=40
hadoop-mapreduce.mapred.job.tracker.handler.count=40

in the .properties file.

Let me know if this works for you. We should probably use something like
this by default.

-- Andrei Savu

On Wed, Oct 5, 2011 at 8:15 PM, Andrei Savu <savu.andrei@gmail.com> wrote:

> Looks like a network congestion issue to me. I don't know how to do this
> but I would try to increase the heartbeat timeout.
>
> Tom any ideas? Have you seen this before on aws?
>
> I don't think there is something wrong with the AMI, I suspect there is
> something wrong with the Hadoop configuration.
>
>
> On Wednesday, October 5, 2011, John Conwell wrote:
>
>> It starts with hadoop reporting bocks of data being 'lost', then
>> individual data nodes stop responding, the individual data nodes get taken
>> off line, then jobs get killed, then data nodes come back on line and the
>> data blocks get replicated back out the correct replication factor.
>>
>> The end result are about 80% of the time, my hadoop jobs get killed
>> because some task fails 3 times in a row, but about an hour after the job
>> gets killed, all data nodes are back online and all data is fully
>> replicated.
>>
>> Before I go rat holing down "why are my data nodes going down", I want to
>> cover the easy scenarios like "oh yea...your totally misconfigured.  You
>> should use ABC ami with the cloudera install and config scripts".  Basically
>> validate if there are any best practices for setting up a cloudera
>> distribution of hadoop on EC2.
>>
>> I know cloudera has created their own AMIs.  Should I be using them?  Does
>> it matter?
>>
>>
>>
>> On Wed, Oct 5, 2011 at 9:43 AM, Andrei Savu <savu.andrei@gmail.com>wrote:
>>
>>> What do you mean by failing? Is the Hadoop daemon shutting down or the
>>> machine as a whole?
>>>
>>> On Wednesday, October 5, 2011, John Conwell wrote:
>>>
>>>> I'm having stability issues (data nodes constantly failing under very
>>>> little load) on the hadoop clusters I'm creating, and I'm trying to figure
>>>> out the best practice for creating the most stable hadoop environment on
>>>> EC2.
>>>>
>>>> In order to run the cdh install and config scripts, I'm
>>>> setting whirr.hadoop-install-function to install_cdh_hadoop, and
>>>> whirr.hadoop-configure-function to configure_cdh_hadoop.  But I'm using a
>>>> plain jane ubuntu amd64 ami (ami-da0cf8b3).  Should I also be using the
>>>> cloudera AMIs as well as the cloudera install and config scripts.
>>>>
>>>> Are they any best practices for how to setup a cloudera distribution of
>>>> hadoop on EC2?
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>> John C
>>>>
>>>>
>>>
>>> --
>>> -- Andrei Savu / andreisavu.ro
>>>
>>>
>>
>>
>> --
>>
>> Thanks,
>> John C
>>
>>
>
> --
> -- Andrei Savu / andreisavu.ro
>
>

Mime
View raw message