phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@biotiquesystems.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Tue, 20 May 2014 22:03:13 GMT
Ho Jeffrey.

Thank you for taking a look. I added log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
to the Custom log4j.properties in Ambari and did a rolling restart of the regionservers (and
verified it was pushed out the nodes in /etc/hbase/conf/log4j.properties ) then issued a "select
count(*)" on the table. Looking at the logs I don't see any TRACE messages. The only two log
messages that happened when issuing the select are:

2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020] coprocessor.UngroupedAggregateRegionObserver:
Starting ungrouped coprocessor scan {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020] coprocessor.UngroupedAggregateRegionObserver:
Finished scanning 18608 rows for ungrouped coprocessor scan {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}

The metrics for the regionserver show only 2 total read requests since it started and 0 per
second now. Those two requests are for the table I did the count on. The regionserver's UI
says it has no active RPC calls.  It currently has one stuck handler from running that command:

"RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000 nid=0x2582 runnable
[0x00007fbd5ba6c000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2888)
	at org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2880)
	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
	at java.util.TreeMap.getEntry(TreeMap.java:340)
	at java.util.TreeMap.get(TreeMap.java:273)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(GetClosestRowBeforeTracker.java:128)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandidate(GetClosestRowBeforeTracker.java:107)
	at org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(GetClosestRowBeforeTracker.java:202)
	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1772)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)

It also has one listener running:

"RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000 nid=0x2579 runnable
[0x00007fbd5c376000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)

If there is anything else you need, please let me know.

thanks again,
-chris


On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:

> 
> I rechecked your RPC log and found that you didn't turn on the RPC trace
> level. You can set the following so that we can see detail request &
> response. 
> 
> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> 
> If you didn't see time out exception in your region server logs, it means
> that there are something keeping issuing lots of Get requests. Have you
> used secondary index for this table? If that's the case, please
> disable/drop it to see if that helps.
> 
> 
> 
> 
> On 5/20/14 1:21 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> 
>> I will see how possible a new, all bare metal cluster is, but we have 5
>> other similarly configured clusters with no troubles. We only have an
>> issue when Phoenix does a full scan of some tables that a handler gets
>> stuck in 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>> after it has returned all of its data.
>> 
>> thank you,
>> 
>> -chris
>> 
>> On May 20, 2014, at 12:34 PM, alex kamil <alex.kamil@gmail.com> wrote:
>> 
>>> i'm just guessing but may be region servers are stuck because they can't
>>> communicate with zookeeper or hmaster, or may be datanodes can't talk to
>>> namenode because of some vmware networking quirk and some internal
>>> thread
>>> is hanging, who knows,
>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>> this
>>> hybrid setup
>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>> eliminate additional variables
>>> 
>>> 
>>> 
>>> 
>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>> <cft@biotiquesystems.com>wrote:
>>> 
>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.kamil@gmail.com> wrote:
>>>> 
>>>>>> "with 3 VMWare "master" nodes"
>>>>> 
>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>> virtual machines
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
>>>>> wrote:
>>>>> 
>>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>> had
>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>> right
>>>>>> after a select * on the exact same table will.
>>>>>> 
>>>>>> If we export the table and then re-import it into a new table, the
>>>>>> new
>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>> upsert..select
>>>>>> to do a copy. It seems something happens to the last region to cause
>>>> this,
>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>> happens. Major compactions get hung up by the running threads as
they
>>>>>> probably have a lock.
>>>>>> 
>>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>>> 
>>>>>> Figuring this out will allow this project to continue, as of now
it
>>>>>> is
>>>>>> hung up on this issue.
>>>>>> 
>>>>>> thank you,
>>>>>> -chris
>>>>>> 
>>>>>> 
>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cft@biotiquesystems.com>
>>>> wrote:
>>>>>> 
>>>>>>> I did some "poor mans" profiling with multiple jstack and came
up
>>>>>>> with
>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> .
>>>>>> That is the last deepest method in all of the traces, either
>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>> for a
>>>>>> thread (which has been running for the last couple hours):
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fa
>>>> stDiffDeltaEncoder.java:540)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEnc
>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFi
>>>> leReaderV2.java:1063)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>> e.java:1776)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1722)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>> tid=0x0000000000cdb800
>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944
>>>> )
>>>>>>>    at
>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBl
>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDa
>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekTo
>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBef
>>>> ore(HFileReaderV2.java:548)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>> Store.java:1712)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>> ava:1655)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>> .java:1826)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>> a:2841)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>    at 
>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>> eduler.java:160)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>> uler.java:38)
>>>>>>>    at
>>>>>> 
>>>> 
>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>> java:110)
>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>> 
>>>>>>> 
>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzhong@hortonworks.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Hey Chris,
>>>>>>>> 
>>>>>>>> I used performance.py tool which created a table with 50K
rows in
>>>>>>>> one
>>>>>>>> table, run the following query from sqlline.py and everything
seems
>>>> fine
>>>>>>>> without seeing CPU running hot.
>>>>>>>> 
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
from
>>>>>>>> PERFORMANCE_50000;
>>>>>>>> +------------+
>>>>>>>> |  COUNT(1)  |
>>>>>>>> +------------+
>>>>>>>> | 50000      |
>>>>>>>> +------------+
>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>> 
>>>>>>>> Is there anyway could you run profiler to see where the CPU
goes?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com>
wrote:
>>>>>>>> 
>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>> 
>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>> 
>>>>>>>>> thanks again,
>>>>>>>>> -chris
>>>>>>>>> 
>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Chris,
>>>>>>>>>> 
>>>>>>>>>> Attachments are filtered out by the mail server.
Can you
>>>>>>>>>> pastebin it
>>>>>>>>>> some
>>>>>>>>>> place?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>> <cft@biotiquesystems.com>wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it
appears to have
>>>> been
>>>>>>>>>>> set
>>>>>>>>>>> to 60 by Ambari during install process). Now
we have narrowed
>>>>>>>>>>> down
>>>>>> what
>>>>>>>>>>> causes the CPU to increase and have some detailed
logs:
>>>>>>>>>>> 
>>>>>>>>>>> If we connect using sqlline.py and execute a
select that selects
>>>> one
>>>>>>>>>>> row
>>>>>>>>>>> using the primary_key, no increate in CPU is
observed and the
>>>> number
>>>>>>>>>>> of RPC
>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>> 
>>>>>>>>>>> If we execute a select that scans the table such
as "select
>>>> count(*)
>>>>>>>>>>> from
>>>>>>>>>>> TABLE" or where the "where" clause only limits
on non-primary
>>>>>>>>>>> key
>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>> threads
>>>>>>>>>>> increases
>>>>>>>>>>> and the CPU utilization of the regionserver increases
by ~105%.
>>>>>>>>>>> 
>>>>>>>>>>> Disconnecting the client does not have an effect
and the
>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at
the higher usage.
>>>>>>>>>>> 
>>>>>>>>>>> Checking the Web Console for the Regionserver
just shows 10
>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state,
no other
>>>>>>>>>>> monitored
>>>>>>>>>>> tasks
>>>>>>>>>>> are happening. The regionserver has a Max Heap
of 10G and a Used
>>>> heap
>>>>>>>>>>> of
>>>>>>>>>>> 445.2M.
>>>>>>>>>>> 
>>>>>>>>>>> I've attached the regionserver log with IPC debug
logging
>>>>>>>>>>> turned on
>>>>>>>>>>> right
>>>>>>>>>>> when one of the Phoenix statements is executed
(this statement
>>>>>> actually
>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>> 
>>>>>>>>>>> thanks,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>> <jzhong@hortonworks.com
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> From the stack, it seems you increase the
default rpc handler
>>>> number
>>>>>>>>>>>> to
>>>>>>>>>>>> about 60. All handlers are serving Get request(You
can search
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>>>>>>> a:2
>>>>>>>>>>>> 841).
>>>>>>>>>>>> 
>>>>>>>>>>>> You can check why there are so many get requests
by adding some
>>>> log
>>>>>>>>>>>> info
>>>>>>>>>>>> or enable hbase rpc trace. I guess if you
decrease the number
>>>>>>>>>>>> of
>>>> rpc
>>>>>>>>>>>> handlers per region server, it will mitigate
your current
>>>>>>>>>>>> issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cft@biotiquesystems.com>
>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> We have hit a problem with Phoenix and
regionservers CPU usage
>>>>>>>>>>>>> spiking
>>>>>>>>>>> up
>>>>>>>>>>>>> to use all available CPU and becoming
unresponsive.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After HDP 2.1 was released we setup a
4 compute node cluster
>>>> (with
>>>>>> 3
>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix
on it. It is a
>>>>>>>>>>>>> plain
>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the
HDP Phoenix RPM release
>>>>>>>>>>>>> and
>>>>>> hand
>>>>>>>>>>>>> linked in the jar files to the hadoop
lib. Everything was
>>>>>>>>>>>>> going
>>>>>> well
>>>>>>>>>>>>> and
>>>>>>>>>>>>> we were able to load in ~30k records
into several tables. What
>>>>>>>>>>>>> happened
>>>>>>>>>>>>> was after about 3-4 days of being up
the regionservers became
>>>>>>>>>>>>> unresponsive and started to use most
of the available CPU (12
>>>> core
>>>>>>>>>>>>> boxes). Nothing terribly informative
was in the logs
>>>>>>>>>>>>> (initially
>>>> we
>>>>>>>>>>>>> saw
>>>>>>>>>>>>> some flush messages that seemed excessive,
but that was not
>>>>>>>>>>>>> all
>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> time and we changed back to the standard
HBase WAL codec). We
>>>>>>>>>>>>> are
>>>>>>>>>>>>> able
>>>>>>>>>>> to
>>>>>>>>>>>>> kill the unresponsive regionservers and
then restart them, the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> will be fine for a day or so but will
start to lock up again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We've dropped the entire HBase and zookeper
information and
>>>> started
>>>>>>>>>>>>> from
>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> James Taylor suggested I send this off
here. I've attached a
>>>> jstack
>>>>>>>>>>>>> report of a locked up regionserver in
hopes that someone can
>>>>>>>>>>>>> shed
>>>>>>>>>>>>> some
>>>>>>>>>>>>> light.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>> NOTICE: This message is intended for the
use of the individual
>>>>>>>>>>>> or
>>>>>>>>>>>> entity
>>>>>>>>>>> to
>>>>>>>>>>>> which it is addressed and may contain information
that is
>>>>>>>>>>>> confidential,
>>>>>>>>>>>> privileged and exempt from disclosure under
applicable law. If
>>>>>>>>>>>> the
>>>>>>>>>>>> reader
>>>>>>>>>>>> of this message is not the intended recipient,
you are hereby
>>>>>> notified
>>>>>>>>>>> that
>>>>>>>>>>>> any printing, copying, dissemination, distribution,
disclosure
>>>>>>>>>>>> or
>>>>>>>>>>>> forwarding of this communication is strictly
prohibited. If you
>>>> have
>>>>>>>>>>>> received this communication in error, please
contact the sender
>>>>>>>>>>> immediately
>>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> NOTICE: This message is intended for the use of the individual
or
>>>>>> entity to
>>>>>>>> which it is addressed and may contain information that is
>>>> confidential,
>>>>>>>> privileged and exempt from disclosure under applicable law.
If the
>>>>>> reader
>>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>> notified
>>>>>> that
>>>>>>>> any printing, copying, dissemination, distribution, disclosure
or
>>>>>>>> forwarding of this communication is strictly prohibited.
If you
>>>>>>>> have
>>>>>>>> received this communication in error, please contact the
sender
>>>>>> immediately
>>>>>>>> and delete it from your system. Thank You.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


Mime
View raw message