hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Regions not getting reassigned if RS is brought down
Date Fri, 15 Jul 2011 20:53:02 GMT
Good on you lads.  Can we get a fix in for 0.90.4?
St.Ack

On Fri, Jul 15, 2011 at 1:02 PM, Shrijeet Paliwal
<shrijeet@rocketfuel.com> wrote:
> So the problem is if you are using an interface anything other than
> 'default' (literally that keyword) DNS.java 's getDefaultHost will return a
> string which will
> have a trailing period at the end. Now to me it seems javadoc of reverseDns
> in DNS.java (see below) is conflicting with what that function is actually
> doing.
> It is returning a PTR record while claims it returns a hostname. The PTR
> record always has period at the end , RFC:
> http://irbs.net/bog-4.9.5/bog47.html
>
>  /**
>   * Returns the hostname associated with the specified IP address by the
>   * provided nameserver.
>   *
>   * @param hostIp
>   *            The address to reverse lookup
>   * @param ns
>   *            The host name of a reachable DNS server
> *   * @return The host name associated with the provided IP*
>   * @throws NamingException
>   *             If a NamingException is encountered
>   */
>  public static String reverseDns(InetAddress hostIp, String ns)
>    throws NamingException {
>    //
>    // Builds the reverse IP lookup form
>    // This is formed by reversing the IP numbers and appending in-addr.arpa
>    //
>    String[] parts = hostIp.getHostAddress().split("\\.");
>    String reverseIP = parts[3] + "." + parts[2] + "." + parts[1] + "."
>      + parts[0] + ".in-addr.arpa";
>
>    System.out.println("reverse ip is :" + reverseIP);
>
>    DirContext ictx = new InitialDirContext();
>    Attributes attribute =
>      ictx.getAttributes("dns://"               // Use "dns:///" if the
> default
>                         + ((ns == null) ? "" : ns) +
>                         // nameserver is to be used
>                         "/" + reverseIP, new String[] { "PTR" });
>    ictx.close();
>
> *    return attribute.get("PTR").get().toString();*
>  }
>
>
> Related issue (I havent gone through it completely but glancing hints it is
> related).
> https://issues.apache.org/jira/browse/HBASE-2599 . Thanks Karthick for
> pointing this out.
>
> A quicky is to recognize that default host has a trailing period and drop it
> when we call it here:
>  String machineName = DNS.getDefaultHost(conf.get(
>        "hbase.regionserver.dns.interface", "default"), conf.get(
>        "hbase.regionserver.dns.nameserver", "default"));
>
> I will open an issue shortly.  Thoughts?
>
> -Shrijeet
> On Fri, Jul 15, 2011 at 10:25 AM, Stack <stack@duboce.net> wrote:
>
>> Thanks for digging in Shrijeet.  We don't do this name matching well
>> in 0.90.x  Sorry for pain caused.  on your observation below about
>> RegionServerTracker, if you figure an improvement, that'd be great.
>>
>> Thanks,
>> St.Ack
>>
>> On Thu, Jul 14, 2011 at 9:07 PM, Shrijeet Paliwal
>> <shrijeet@rocketfuel.com> wrote:
>> > I have narrowed it down to following :
>> >
>> >  // Server to handle client requests
>> >    String machineName = DNS.getDefaultHost(conf.get(
>> >        "hbase.regionserver.dns.interface", "default"), conf.get(
>> >        "hbase.regionserver.dns.nameserver", "default"));
>> >
>> > I am not using the default interface for RS. I have changed it to 'eth1'
>> > . The machineName is getting set as 'server-2.rfiserve.net.'
>> > Notice the extra period in the end.
>> >
>> > Because of above there is an inconsistency in the way zookeeper recorded
>> the
>> > regionserver address and way ServerManager had it in its cached list of
>> > onlineservers.
>> > You will notice the extra dot in zookeeper entry but not in the
>> ServerManager
>> > list.
>> >
>> > [zk: localhost:2181(CONNECTED) 3] ls /hbase/rs
>> > [server-2.domain.net.,60020,1310684522383,server-1.domain.net
>> > .,60020,1310680203359]
>> >
>> >
>> > In ServerManager we do following :
>> >
>> > void recordNewServer(HServerInfo info, boolean useInfoLoad,
>> >      HRegionInterface hri) {
>> >    HServerLoad load = useInfoLoad? info.getLoad(): new HServerLoad();
>> >    String serverName = info.getServerName();
>> >    LOG.info("Registering server=" + serverName + ", regionCount=" +
>> >      load.getLoad() + ", userLoad=" + useInfoLoad);
>> >    info.setLoad(load);
>> >    // TODO: Why did we update the RS location ourself?  Shouldn't RS do
>> > this?
>> >    // masterStatus.getZooKeeper().updateRSLocationGetWatch(info,
>> watcher);
>> >    // -- If I understand the question, the RS does not update the
>> location
>> >    // because could be disagreement over locations because of DNS issues;
>> > only
>> >    // master does DNS now -- St.Ack 20100929.
>> >    this.onlineServers.put(serverName, info);
>> > ......
>> >
>> > In RegionServerTracker after node deletion but pre server expiration a
>> map
>> > lookup happens, it will lookup for server-2.domain.net
>> .,60020,1310684522383
>> > (with an extra period) but actual key in map is
>> > server-2.domain.net,60020,1310684522383
>> > (without the extra period)
>> >
>> >
>> >  @Override
>> >  public void nodeDeleted(String path) {
>> >    if(path.startsWith(watcher.rsZNode)) {
>> >      String serverName = ZKUtil.getNodeName(path);
>> >      LOG.info("RegionServer ephemeral node deleted, processing expiration
>> > [" +
>> >          serverName + "]");
>> >      HServerInfo hsi = serverManager.getServerInfo(serverName);
>> >      if(hsi == null) {
>> >        LOG.info("No HServerInfo found for " + serverName);
>> >        return;
>> >      }
>> >      serverManager.expireServer(hsi);
>> >    }
>> >  }
>> >
>> > The lookup will fail and expiration will never happen. I will get back
>> when
>> > I have more details on why the DNS is being returned as such.
>> > An interesting question is - is it ok to not expire the region server
>> when
>> > we already deleted the entry of the RS from zookeeper.
>> >
>> > On Thu, Jul 14, 2011 at 4:32 PM, Shrijeet Paliwal
>> > <shrijeet@rocketfuel.com>wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> Hbase Version: 0.90.3
>> >> Hadoop Version: cdh3u0
>> >> 2 region servers, zookeeper quorum managed by hbase.
>> >>
>> >> I was doing some tests and it seemed regions are not getting reassigned
>> by
>> >> master if RS is brought down.
>> >> Here are the steps:
>> >>
>> >> 0. Cluster in a steady state. Pick a random key: k1 belonging to a RS:
>> rs1
>> >> and perform a get from shell. Result comes back fine.
>> >> 1. Bring down rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh
>> --config
>> >> /usr/lib/hbase-0.20/conf/ stop regionserver]
>> >> 2. Wait few second and do a get from shell for k1 again. k1 is still
>> being
>> >> located at rs1 and RetriesExhaustedException occurs.
>> >> 3. Wait few minutes and do a get from shell for k1 again. k1 is still
>> being
>> >> located at rs1 and RetriesExhaustedException occurs.
>> >> 4. Bring up rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh --config
>> >> /usr/lib/hbase-0.20/conf/ start regionserver]
>> >> 5. A get from shell brings back the result just fine.
>> >>
>> >> My hope at step (3) was a reassignment of regions and get should have
>> >> succeeded. 0.90.2 has introduced process to do things more gracefully
>> which
>> >> is great,
>> >> but that (graceful shutdown) is not always possible.
>> >> I have pastebin-ed the relevant logs. Can anyone help me understand the
>> >> scenario?
>> >>
>> >> Hbase Shell after RS brought down
>> >> http://pastebin.com/8bvk5RFV
>> >>
>> >> RS log around time it was brought down
>> >> http://pastebin.com/sgVRVCCj
>> >>
>> >> Zkdump after RS brought down
>> >> http://pastebin.com/meyqCVJ0
>> >>
>> >> Hmaster log around time RS was brought down
>> >> http://pastebin.com/jBGKuy74
>> >>
>> >> hbck after RS brought down
>> >> http://pastebin.com/bxvyTTF5
>> >>
>> >> hbck after RS brought up
>> >> http://pastebin.com/FPxvT9qW
>> >>
>> >
>>
>

Mime
View raw message