hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)
Date Wed, 28 Oct 2009 19:40:44 GMT
JG,


Thanks a lot for the tips.
I set the HEAP to 4GB and GC options as -XX:ParallelGCThreads=8
 -XX:+UseConcMarkSweepGC.

I checked the logs in my Master an RS and found the following errors.
Basically, master got exception error while scanning ROOT, then the ROOT
region was offline and unset.  Thus the regionserver can't get
NotservingRegion errors.

In the master:
2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: x.x.x.
x:60021, regionname: -ROOT-,,0, startKey: <>}
2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan ROOT region
java.io.IOException: Call to /x.x.x.x:60021 failed on local exception:
java.io.EOFException
        at
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:757)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:727)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
        at $Proxy1.openScanner(Unknown Source)
        at
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
        at
org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
        at
org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
        at
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:504)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:448)
2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: x.x.x.
x:60021, regionname: .META.,,1, startKey: <>}
2009-10-28 19:00:30,591 WARN org.apache.hadoop.hbase.master.BaseScanner:
Scan one META region: {server: x.x.x.x:60021, regionname: .M
ETA.,,1, startKey: <>}
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
        at $Proxy1.openScanner(Unknown Source)
        at
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
        at
org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
        at
org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
        at
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
2009-10-28 19:00:30,591 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.ServerManager:
Removing server's info YYYY,60021,125675547057
0
2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager:
Offlined ROOT server: x.x.x.x:60021

2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be reassigned)
2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager:
ROOT inserted into regionsInTransition
2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager:
Offlining META region: {server: x.x.x.x:60021, regionname: .META.,,1,
startKey: <>}
2009-10-28 19:00:31,395 INFO org.apache.hadoop.hbase.master.RegionManager:
META region removed from onlineMetaRegions



On the regionserver:
2009-10-28 18:51:14,578 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
test,,1256755871065
2009-10-28 18:51:14,578 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
test,,1256755871065
2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region test,,1256755871065/796855017 available; sequence id is 10013291
2009-10-28 18:51:14,578 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting compaction on region test,,1256755871065
2009-10-28 18:51:18,388 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
response for sessionid:0x249c76021d0001 after 0ms
2009-10-28 18:51:19,341 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-10-28 18:51:19,341 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60021, call get([B@21fefd80, row=1053508149, maxVersions=1,
timeRange=[0,9223372036854775807), families={(family=email_ip_activity,
columns=ALL}) from x.x.x.x:54669: error:
org.apache.hadoop.hbase.NotServingRegionException: test,,1256754924503





On Wed, Oct 28, 2009 at 2:56 PM, Jonathan Gray <jlist@streamy.com> wrote:

> These client error messages are not particular descriptive as to the root
> cause (they are fatal errors, or close to it).
>
> What is going on in your regionservers when these errors happen?  Check the
> master and RS logs.
>
> Also, you definitely do not want 19 zookeeper nodes.  Reduce that to 3 or 5
> max.
>
> What is the hardware you are using for these nodes, and what settings do
> you have for heap/GC?
>
> JG
>
>
> Zhenyu Zhong wrote:
>
>> Stack,
>>
>> Thank you very much for your comments.
>> I am running a cluster with 20 nodes. I set 19 as both regionserver and
>> zookeeper quorums.
>> The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
>> I started with an empty table and try to load 200 million records into
>> that
>> empty table.
>> There is a key in each record. Logically, in my MR program, during the
>> setup, I opened an HTable, in my mapper, I fetch the record from HTable
>> via
>> key in the record, then make some changes to the columns and update that
>> row
>> back to HTable through TableOutputFormat by passing a put. There is no
>> reduce tasks involved here.  (Though it is unnecessary to fetch row from
>> an
>> empty table, I just intended to do that)
>>
>> Additionally, when I reduced the number of regionservers and number of
>> zookeeper quorums.
>> I had different errors:
>> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out
>> trying
>> to locate root region at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
>> at
>>
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at
>> org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
>>
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> Many thanks in advance.
>> zhenyu
>>
>>
>>
>>
>> On Wed, Oct 28, 2009 at 12:39 PM, stack <stack@duboce.net> wrote:
>>
>>  Whats your cluster topology?  How many nodes involved?  When you see the
>>> below message, how many regions in your table?  How are you loading your
>>> table?
>>> Thanks,
>>> St.Ack
>>>
>>> On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <zhongresearch@gmail.com
>>>
>>>> wrote:
>>>> Nitay,
>>>>
>>>> I am very appreciated.
>>>>
>>>> As Ryan suggested, I increased the zookeeper session timeout to
>>>> 40seconds
>>>> along with the GC options -XX:ParallelGCThreads=8
>>>>
>>>  -XX:+UseConcMarkSweepGC
>>>
>>>> in place. I set the Heapsize to 4GB.  I also set the vm.swappiness=0.
>>>>
>>>> However it still ran into problem. Please find the following errors.
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>>> contact region server x.x.x.x:60021 for region
>>>> YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10
>>>> attempts.
>>>> Exceptions:
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> setting up proxy to /x.x.x.x:60021 after attempts=1
>>>>
>>>>       at
>>>>
>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)
>>>
>>>>       at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)
>>>>
>>>>
>>>> The input file is about 10GB around 200million rows of data.
>>>> This load doesn't seem too large. However this kind of errors keep
>>>>
>>> popping
>>>
>>>> up.
>>>>
>>>> Does Regionserver need to be deployed to dedicated machines?
>>>> Does Zookeeper need to be deployed to dedicated machines as well?
>>>>
>>>> Best,
>>>> zhenyu
>>>>
>>>>
>>>>
>>>> On Wed, Oct 28, 2009 at 1:37 AM, nitay <nitayj@gmail.com> wrote:
>>>>
>>>>  Hi Zhenyu,
>>>>>
>>>>> Sorry for the delay. I started working on this a while back, before I
>>>>>
>>>> left
>>>>
>>>>> my job for another company. Since then I haven't had much time to work
>>>>>
>>>> on
>>>
>>>> HBase unfortunately :(. I'll try to dig up what I had and see what
>>>>>
>>>> shape
>>>
>>>> it's in and update you.
>>>>>
>>>>> Cheers,
>>>>> -n
>>>>>
>>>>>
>>>>> On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:
>>>>>
>>>>>  Sorry I must have mistyped, I meant to say "40 seconds".  You can
>>>>>
>>>>>> still see multi-second pauses at times, so you need to give yourself
a
>>>>>> bigger buffer.
>>>>>>
>>>>>> The parallel threads argument should not be necessary, but you do
need
>>>>>> the UseConcMarkSweepGC flag as well.
>>>>>>
>>>>>> Let us know how it goes!
>>>>>> -ryan
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <
>>>>>>
>>>>> zhongresearch@gmail.com>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Ryan,
>>>>>>> I am very appreciated for your feedbacks.
>>>>>>> I have set the zookeeper.session.timeout to seconds which is
way
>>>>>>>
>>>>>> higher
>>>
>>>> than
>>>>>>> 40ms.
>>>>>>> In the same time, the -Xms is set to 4GB, which should be sufficient.
>>>>>>> I also tried GC options like
>>>>>>>
>>>>>>>  -XX:ParallelGCThreads=8
>>>>>>> -XX:+UseConcMarkSweepGC
>>>>>>>
>>>>>>> I even set the vm.swappiness=0
>>>>>>>
>>>>>>> However, I still came across the problem that a RegionServer
shutdown
>>>>>>> itself.
>>>>>>>
>>>>>>> Best,
>>>>>>> zhong
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>
>>>>>  Set the ZK timeout to something like 40ms, and give the GC enough
>>>>>>>
>>>>>> Xmx
>>>
>>>> so you never risk entering the much dreaded concurrent-mode-failure
>>>>>>>> whereby the entire heap must be GCed.
>>>>>>>>
>>>>>>>> Consider testing Java 7 and the G1 GC.
>>>>>>>>
>>>>>>>> We could get a JNI thread to do this, but no one has done
so yet. I
>>>>>>>>
>>>>>>> am
>>>
>>>> personally hoping for G1 and in the meantime overprovision our Xmx
>>>>>>>>
>>>>>>> to
>>>
>>>> avoid the concurrent mode failures.
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <
>>>>>>>>
>>>>>>> zhongresearch@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>  Ryan,
>>>>>>>>>
>>>>>>>>> Thank you very much.
>>>>>>>>> May I ask whether there are any ways to get around this
problem to
>>>>>>>>>
>>>>>>>> make
>>>>
>>>>> HBase more stable?
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> zhong
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  There isnt any working code yet. Just an idea, and a
prototype.
>>>>>>>>>
>>>>>>>>>> There is some sense that if we can get the G1 GC
that we could get
>>>>>>>>>>
>>>>>>>>> rid
>>>>
>>>>> of all long pauses, and avoid the need for this.
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
>>>>>>>>>> zhongresearch@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am very interesting to the solution that Joey
proposed and
>>>>>>>>>>>
>>>>>>>>>> would
>>>
>>>> like
>>>>>>>>>>
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>> have a try.
>>>>>>>>>>> Does anyone have any ideas on how to deploy this
zk_wrapper in
>>>>>>>>>>>
>>>>>>>>>> JNI
>>>
>>>> integration?
>>>>>>>>>>>
>>>>>>>>>>> I would be very appreciated.
>>>>>>>>>>>
>>>>>>>>>>> thanks
>>>>>>>>>>> zhong
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message