hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuy...@snowcocoa.info>
Subject Re: HBase 0.20.1 Distributed Install Problems
Date Wed, 11 Nov 2009 09:50:10 GMT
Hi Chris, and thanks Lars for help.

OK. So "jstack 22200" shows your region server is trying to finish
starting up, but stuck in a middle when try to get IP address of the
master from ZooKeeper.

===========================================================
"main" prio=10 tid=0x0805a800 nid=0x56d2 waiting on condition [0xb72f2000]
  java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
 at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:51)
 at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HRegionServer.java:387)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(HRegionServer.java:315)
 at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionServer.java:306)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:276)
===========================================================

I still need to see the regionserver logs to figure out why this is happening.


Also,

> This is what I see when I run start-hbase.sh -- I can ssh into any of the
> boxes with no password just fine, it just gives me a weird first time host
> message...we get the same thing when we start up hadoop.
...
> crunch2: regionserver running as process 6950. Stop it first.
> chanel: regionserver running as process 22200. Stop it first.
> crunch3: regionserver running as process 28962. Stop it first.
> chris: regionserver running as process 28719. Stop it first.

This "Stop it first" message means your region servers didn't stop
when you ran stop-hbase.sh. The master couldn't locate those region
servers so it couldn't tell them to shutdown. This is why you've got
those orphan region servers. So until we finish setting your HBase
cluster up, you'll have to stop those region servers by hand.

To do this, ssh to M2 -- M5, and type the following command:
${HBASE_HOME}/bin/hbase-daemon.sh stop regionserver

Then jps again to make sure HRegionServer doesn't exist. If the above
command doesn't work, you can use Unix "kill" command. Then ssh to M1,
run stop-hbase.sh to stop the master and ZooKeepers.


It's still a mystery you don't have regionserver logs while you have
zookeeper logs. Maybe those orphan region servers was the reason?  I
don't know, but you can give it another try after stopping them. So,
try to stop whole HBase / ZooKeeper process by above way, then run
start-hbase.sh once again. If you can get the regionserver log to us,
that would be great.

Thanks,

-- 
Tatsuya Kawano (Mr.)
Tokyo, Japan




On Wed, Nov 11, 2009 at 5:52 PM, Chris Bates
<christopher.andrew.bates@gmail.com> wrote:
> Hi Lars,
>
> By no logs I mean that when I ssh into any of the M2-M5 boxes and check the
> logs folder, there is only zookeeper logs, no RS logs (see below).  The
> permissions are ok.
>
> This is what I see when I run start-hbase.sh -- I can ssh into any of the
> boxes with no password just fine, it just gives me a weird first time host
> message...we get the same thing when we start up hadoop.
>
> hadoop@chanel2:/opt/hadoop/hbase-0.20.1$ bin/start-hbase.sh
> crunch2: Warning: Permanently added '[crunch2]:2200,[172.16.1.95]:2200'
> (RSA) to the list of known hosts.
> chanel: Warning: Permanently added '[chanel]:2200,[172.16.1.45]:2200' (RSA)
> to the list of known hosts.
> chanel2: Warning: Permanently added '[chanel2]:2200,[172.16.1.46]:2200'
> (RSA) to the list of known hosts.
> chris: Warning: Permanently added '[chris]:2200,[172.16.1.83]:2200' (RSA) to
> the list of known hosts.
> crunch3: Warning: Permanently added '[crunch3]:2200,[172.16.1.96]:2200'
> (RSA) to the list of known hosts.
> chanel: starting zookeeper, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-zookeeper-chanel.out
> chanel2: starting zookeeper, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-zookeeper-chanel2.out
> chris: starting zookeeper, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-zookeeper-chris.out
> crunch2: starting zookeeper, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-zookeeper-crunch2.out
> crunch3: starting zookeeper, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-zookeeper-crunch3.out
> starting master, logging to
> /opt/hadoop/hbase-0.20.1/bin/../logs/hbase-hadoop-master-chanel2.out
> crunch2: Warning: Permanently added '[crunch2]:2200,[172.16.1.95]:2200'
> (RSA) to the list of known hosts.
> crunch3: Warning: Permanently added '[crunch3]:2200,[172.16.1.96]:2200'
> (RSA) to the list of known hosts.
> chanel: Warning: Permanently added '[chanel]:2200,[172.16.1.45]:2200' (RSA)
> to the list of known hosts.
> chris: Warning: Permanently added '[chris]:2200,[172.16.1.83]:2200' (RSA) to
> the list of known hosts.
> crunch2: regionserver running as process 6950. Stop it first.
> chanel: regionserver running as process 22200. Stop it first.
> crunch3: regionserver running as process 28962. Stop it first.
> chris: regionserver running as process 28719. Stop it first.
>
>
> Here is the jstack from one of the boxes:
>
> hadoop@chanel:/opt/hadoop/hbase-0.20.1$ jps
> 23777 TaskTracker
> 30449 Jps
> 23694 DataNode
> 26747 Main
> 22200 HRegionServer
> 30174 HQuorumPeer
>
> hadoop@chanel:/opt/hadoop/hbase-0.20.1$ jstack 22200
> 2009-11-11 03:43:56
> Full thread dump Java HotSpot(TM) Server VM (14.2-b01 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x083f8000 nid=0x7709 waiting on
> condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "main-EventThread" daemon prio=10 tid=0x6e586400 nid=0x56e3 waiting on
> condition [0x6e4ad000]
>   java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x73865330> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>  at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:414)
>
> "main-SendThread" daemon prio=10 tid=0x6e572400 nid=0x56e2 waiting on
> condition [0x6e4fe000]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
>  at
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:851)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:895)
>
> "Low Memory Detector" daemon prio=10 tid=0x0813ac00 nid=0x56dd runnable
> [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x08139000 nid=0x56dc waiting on
> condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x08136400 nid=0x56db waiting on
> condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x08134c00 nid=0x56da runnable
> [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x08133400 nid=0x56d9
> waiting on condition [0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x0811f800 nid=0x56d8 in Object.wait()
> [0x6ec75000]
>   java.lang.Thread.State: WAITING (on object monitor)
>  at java.lang.Object.wait(Native Method)
> - waiting on <0x73860458> (a java.lang.ref.ReferenceQueue$Lock)
>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
> - locked <0x73860458> (a java.lang.ref.ReferenceQueue$Lock)
>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x0811e400 nid=0x56d7 in
> Object.wait() [0x6ecc6000]
>   java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
>  - waiting on <0x738657e0> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:485)
>  at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> - locked <0x738657e0> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0805a800 nid=0x56d2 waiting on condition [0xb72f2000]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
>  at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
> at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:51)
>  at
> org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HRegionServer.java:387)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(HRegionServer.java:315)
>  at
> org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionServer.java:306)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:276)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>  at
> org.apache.hadoop.hbase.regionserver.HRegionServer.doMain(HRegionServer.java:2472)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2540)
>
> "VM Thread" prio=10 tid=0x0811a400 nid=0x56d6 runnable
>
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x0805e400 nid=0x56d3
> runnable
>
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x0805fc00 nid=0x56d4
> runnable
>
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x080cd800 nid=0x56d5
> runnable
> "VM Periodic Task Thread" prio=10 tid=0x0813cc00 nid=0x56de waiting on
> condition
>
> JNI global references: 691
>
> hadoop@chanel:/opt/hadoop/hbase-0.20.1$ ls -l
> total 3628
> drwxr-xr-x 2 hadoop hadoop    4096 2009-11-10 21:41 bin
> -rw-r--r-- 1 hadoop hadoop   21416 2009-11-10 21:41 build.xml
> -rw-r--r-- 1 hadoop hadoop  115584 2009-11-10 21:41 CHANGES.txt
> drwxr-xr-x 2 hadoop hadoop    4096 2009-11-11 02:00 conf
> drwxr-xr-x 4 hadoop hadoop    4096 2009-11-10 21:41 contrib
> drwxr-xr-x 5 hadoop hadoop    4096 2009-11-10 21:41 docs
> -rw-r--r-- 1 hadoop hadoop 1544829 2009-11-10 21:41 hbase-0.20.1.jar
> -rw-r--r-- 1 hadoop hadoop 1954331 2009-11-10 21:41 hbase-0.20.1-test.jar
> drwxr-xr-x 4 hadoop hadoop    4096 2009-11-10 21:41 lib
> -rw-r--r-- 1 hadoop hadoop   11358 2009-11-10 21:41 LICENSE.txt
> drwxr-xr-x 2 hadoop hadoop    4096 2009-11-11 03:38 logs
> -rw-r--r-- 1 hadoop hadoop    1741 2009-11-10 21:41 NOTICE.txt
> -rw-r--r-- 1 hadoop hadoop      43 2009-11-10 21:41 README.txt
> drwxr-xr-x 8 hadoop hadoop    4096 2009-11-10 21:41 src
> drwxr-xr-x 6 hadoop hadoop    4096 2009-11-10 21:41 webapps
>
> hadoop@chanel:/opt/hadoop/hbase-0.20.1$ cd logs/
> hadoop@chanel:/opt/hadoop/hbase-0.20.1/logs$ ll
> total 72
> -rw-r--r-- 1 hadoop hadoop 66759 2009-11-11 03:38
> hbase-hadoop-zookeeper-chanel.log
> -rw-r--r-- 1 hadoop hadoop     0 2009-11-11 03:38
> hbase-hadoop-zookeeper-chanel.out
> -rw-r--r-- 1 hadoop hadoop     0 2009-11-11 03:00
> hbase-hadoop-zookeeper-chanel.out.1
> -rw-r--r-- 1 hadoop hadoop     0 2009-11-11 02:56
> hbase-hadoop-zookeeper-chanel.out.2
> -rw-r--r-- 1 hadoop hadoop     0 2009-11-11 02:36
> hbase-hadoop-zookeeper-chanel.out.3
> -rw-r--r-- 1 hadoop hadoop     0 2009-11-11 02:18
> hbase-hadoop-zookeeper-chanel.out.4
>
>
>
>
> On Wed, Nov 11, 2009 at 3:15 AM, Lars George <lars@worldlingo.com> wrote:
>
>> Chris,
>>
>> What do you mean there are no region server logs? On the M2-M5 you have no
>> logs? Is the Java process for the RS running? If so, could you jstck it to
>> see where it hangs?
>>
>> Maybe you have an access/owner issue with the log dirs on the RS machines?
>>
>> The master log looks OK.
>>
>> Lars
>>
>> Chris Bates schrieb:
>>
>>> Again, I really appreciate the help.  I removed the master from the region
>>> server list and made sure the rest of the machines had an updated list.
>>>  No
>>> region servers still:
>>> hbase(main):001:0> zk_dump
>>>
>>> HBase tree in ZooKeeper is rooted at /hbase
>>>  Cluster up? true
>>>  In safe mode? true
>>>  Master address: 172.16.1.46:60000
>>>  Region server holding ROOT: 172.16.1.46:60020
>>>  Region servers:
>>>
>>> hbase(main):002:0> status 'simple'
>>> 0 live servers
>>> 0 dead servers
>>>
>>> I checked the /etc/hosts file on all machines and they all have 127.0.0.1
>>> localhost.localdomain localhost and then their other mappings for other
>>> domains, with the box name mapping was removed.
>>>
>>> There are no regionserver logs.  But the master log is this:

Mime
View raw message