phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Zhong <jzh...@hortonworks.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Wed, 21 May 2014 00:50:55 GMT


Hey Chris,

Sorry for the trace option. The right option should be below. Before you
were able to turn on DEBUG, you just need to change it to TRACE.

    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE



Do you see that the two read requests burn one of your CPU to 100%? From
the call stacks you posted today and before, it seems there might be an
infinite loop inside the function Htable.getRowOrBefore() which is
deprecated and supposed to only work for meta region.(brought by Enis).
Fortunately there is only one place in phoenix call this function, so
could you try the following things?

set following configuration in client side hbase-site.xml to disable stats
update 
   phoenix.query.statsUpdateFrequency = 2147483647
   phoenix.query.maxStatsAge = 0

After the configuration changes, could you verify if the CPU is still
spiking for the same query?

You can also do the following:

Alter your table's ColumnFamily data block encoding to NONE as
DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
Hbase client.


We suspect that the combination of your data, FAST_DIFF data block
encoding & the deprecated Htable.getRowOrBefore() cause the issue.



On 5/20/14 3:03 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:

>Ho Jeffrey.
>
>Thank you for taking a look. I added
>log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>log4j.properties in Ambari and did a rolling restart of the regionservers
>(and verified it was pushed out the nodes in
>/etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>table. Looking at the logs I don't see any TRACE messages. The only two
>log messages that happened when issuing the select are:
>
>2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>coprocessor scan 
>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>rows for ungrouped coprocessor scan
>{"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>1,"filter":"FirstKeyOnlyFilter","caching":-1}
>
>The metrics for the regionserver show only 2 total read requests since it
>started and 0 per second now. Those two requests are for the table I did
>the count on. The regionserver's UI says it has no active RPC calls.  It
>currently has one stuck handler from running that command:
>
>"RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>nid=0x2582 runnable [0x00007fbd5ba6c000]
>   java.lang.Thread.State: RUNNABLE
>	at 
>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>888)
>	at 
>org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>880)
>	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>	at java.util.TreeMap.getEntry(TreeMap.java:340)
>	at java.util.TreeMap.get(TreeMap.java:273)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>GetClosestRowBeforeTracker.java:128)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>te(GetClosestRowBeforeTracker.java:107)
>	at 
>org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>ClosestRowBeforeTracker.java:202)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>java:1772)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>ore.java:1722)
>	at 
>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>a:1655)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>ava:1826)
>	at 
>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>2841)
>	at 
>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>llBlockingMethod(ClientProtos.java:28857)
>	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>uler.java:160)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>er.java:38)
>	at 
>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>va:110)
>	at java.lang.Thread.run(Thread.java:744)
>
>It also has one listener running:
>
>"RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>nid=0x2579 runnable [0x00007fbd5c376000]
>   java.lang.Thread.State: RUNNABLE
>	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>
>If there is anything else you need, please let me know.
>
>thanks again,
>-chris
>
>
>On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:
>
>> 
>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>> level. You can set the following so that we can see detail request &
>> response. 
>> 
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>> 
>> If you didn't see time out exception in your region server logs, it
>>means
>> that there are something keeping issuing lots of Get requests. Have you
>> used secondary index for this table? If that's the case, please
>> disable/drop it to see if that helps.
>> 
>> 
>> 
>> 
>> On 5/20/14 1:21 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
>> 
>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>> other similarly configured clusters with no troubles. We only have an
>>> issue when Phoenix does a full scan of some tables that a handler gets
>>> stuck in 
>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>> after it has returned all of its data.
>>> 
>>> thank you,
>>> 
>>> -chris
>>> 
>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.kamil@gmail.com> wrote:
>>> 
>>>> i'm just guessing but may be region servers are stuck because they
>>>>can't
>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>to
>>>> namenode because of some vmware networking quirk and some internal
>>>> thread
>>>> is hanging, who knows,
>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>> this
>>>> hybrid setup
>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>> eliminate additional variables
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>> <cft@biotiquesystems.com>wrote:
>>>> 
>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>> 
>>>>> -chris
>>>>> 
>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.kamil@gmail.com>
>>>>>wrote:
>>>>> 
>>>>>>> "with 3 VMWare "master" nodes"
>>>>>> 
>>>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>>>> virtual machines
>>>>>> 
>>>>>> 
>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>><cft@biotiquesystems.com
>>>>>> wrote:
>>>>>> 
>>>>>>> Sorry to follow up on my own message, but I was wondering if
anyone
>>>>>>> had
>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom,
but
>>>>>>> right
>>>>>>> after a select * on the exact same table will.
>>>>>>> 
>>>>>>> If we export the table and then re-import it into a new table,
the
>>>>>>> new
>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>> upsert..select
>>>>>>> to do a copy. It seems something happens to the last region to
>>>>>>>cause
>>>>> this,
>>>>>>> but it is not directly data dependent. Moving the region to another
>>>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>>>> happens. Major compactions get hung up by the running threads
as
>>>>>>>they
>>>>>>> probably have a lock.
>>>>>>> 
>>>>>>> I've run the hfile tool on the final region and nothing seems
awry.
>>>>>>> 
>>>>>>> Figuring this out will allow this project to continue, as of
now it
>>>>>>> is
>>>>>>> hung up on this issue.
>>>>>>> 
>>>>>>> thank you,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cft@biotiquesystems.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> I did some "poor mans" profiling with multiple jstack and
came up
>>>>>>>> with
>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>le
>>>>>>> .
>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>>> for a
>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>> 
>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>> tid=0x0000000000cdb800
>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>Fa
>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>nc
>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>Fi
>>>>> leReaderV2.java:1063)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>or
>>>>> e.java:1776)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>(H
>>>>> Store.java:1722)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>.j
>>>>> ava:1655)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>on
>>>>> .java:1826)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>> a:2841)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>2.
>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>ch
>>>>> eduler.java:160)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>ed
>>>>> uler.java:38)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>r.
>>>>> java:110)
>>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>>> 
>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>> tid=0x0000000000cdb800
>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>44
>>>>> )
>>>>>>>>    at
>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>Bl
>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>Da
>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>To
>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>ef
>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>(H
>>>>> Store.java:1712)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>.j
>>>>> ava:1655)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>on
>>>>> .java:1826)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>> a:2841)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>2.
>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>    at 
>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>ch
>>>>> eduler.java:160)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>ed
>>>>> uler.java:38)
>>>>>>>>    at
>>>>>>> 
>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>r.
>>>>> java:110)
>>>>>>>>    at java.lang.Thread.run(Thread.java:744)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>><jzhong@hortonworks.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hey Chris,
>>>>>>>>> 
>>>>>>>>> I used performance.py tool which created a table with
50K rows in
>>>>>>>>> one
>>>>>>>>> table, run the following query from sqlline.py and everything
>>>>>>>>>seems
>>>>> fine
>>>>>>>>> without seeing CPU running hot.
>>>>>>>>> 
>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select
count(*) from
>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>> +------------+
>>>>>>>>> |  COUNT(1)  |
>>>>>>>>> +------------+
>>>>>>>>> | 50000      |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select
count(*) from
>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>> +------------+
>>>>>>>>> |  COUNT(1)  |
>>>>>>>>> +------------+
>>>>>>>>> | 50000      |
>>>>>>>>> +------------+
>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>> 
>>>>>>>>> Is there anyway could you run profiler to see where the
CPU goes?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com>
>>>>>>>>>wrote:
>>>>>>>>> 
>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>> 
>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>> 
>>>>>>>>>> thanks again,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Chris,
>>>>>>>>>>> 
>>>>>>>>>>> Attachments are filtered out by the mail server.
Can you
>>>>>>>>>>> pastebin it
>>>>>>>>>>> some
>>>>>>>>>>> place?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>> <cft@biotiquesystems.com>wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> We set the HBase RegionServer Handler to
10 (it appears to
>>>>>>>>>>>>have
>>>>> been
>>>>>>>>>>>> set
>>>>>>>>>>>> to 60 by Ambari during install process).
Now we have narrowed
>>>>>>>>>>>> down
>>>>>>> what
>>>>>>>>>>>> causes the CPU to increase and have some
detailed logs:
>>>>>>>>>>>> 
>>>>>>>>>>>> If we connect using sqlline.py and execute
a select that
>>>>>>>>>>>>selects
>>>>> one
>>>>>>>>>>>> row
>>>>>>>>>>>> using the primary_key, no increate in CPU
is observed and the
>>>>> number
>>>>>>>>>>>> of RPC
>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>>>> 
>>>>>>>>>>>> If we execute a select that scans the table
such as "select
>>>>> count(*)
>>>>>>>>>>>> from
>>>>>>>>>>>> TABLE" or where the "where" clause only limits
on non-primary
>>>>>>>>>>>> key
>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>>> threads
>>>>>>>>>>>> increases
>>>>>>>>>>>> and the CPU utilization of the regionserver
increases by
>>>>>>>>>>>>~105%.
>>>>>>>>>>>> 
>>>>>>>>>>>> Disconnecting the client does not have an
effect and the
>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays
at the higher usage.
>>>>>>>>>>>> 
>>>>>>>>>>>> Checking the Web Console for the Regionserver
just shows 10
>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING
state, no other
>>>>>>>>>>>> monitored
>>>>>>>>>>>> tasks
>>>>>>>>>>>> are happening. The regionserver has a Max
Heap of 10G and a
>>>>>>>>>>>>Used
>>>>> heap
>>>>>>>>>>>> of
>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>> 
>>>>>>>>>>>> I've attached the regionserver log with IPC
debug logging
>>>>>>>>>>>> turned on
>>>>>>>>>>>> right
>>>>>>>>>>>> when one of the Phoenix statements is executed
(this statement
>>>>>>> actually
>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From the stack, it seems you increase
the default rpc handler
>>>>> number
>>>>>>>>>>>>> to
>>>>>>>>>>>>> about 60. All handlers are serving Get
request(You can search
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>.j
>>>>>>> av
>>>>>>>>>>>> a:2
>>>>>>>>>>>>> 841).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can check why there are so many get
requests by adding
>>>>>>>>>>>>>some
>>>>> log
>>>>>>>>>>>>> info
>>>>>>>>>>>>> or enable hbase rpc trace. I guess if
you decrease the number
>>>>>>>>>>>>> of
>>>>> rpc
>>>>>>>>>>>>> handlers per region server, it will mitigate
your current
>>>>>>>>>>>>> issue.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cft@biotiquesystems.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have hit a problem with Phoenix
and regionservers CPU
>>>>>>>>>>>>>>usage
>>>>>>>>>>>>>> spiking
>>>>>>>>>>>> up
>>>>>>>>>>>>>> to use all available CPU and becoming
unresponsive.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After HDP 2.1 was released we setup
a 4 compute node cluster
>>>>> (with
>>>>>>> 3
>>>>>>>>>>>>>> VMWare "master" nodes) to test out
Phoenix on it. It is a
>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added
the HDP Phoenix RPM release
>>>>>>>>>>>>>> and
>>>>>>> hand
>>>>>>>>>>>>>> linked in the jar files to the hadoop
lib. Everything was
>>>>>>>>>>>>>> going
>>>>>>> well
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> we were able to load in ~30k records
into several tables.
>>>>>>>>>>>>>>What
>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>> was after about 3-4 days of being
up the regionservers
>>>>>>>>>>>>>>became
>>>>>>>>>>>>>> unresponsive and started to use most
of the available CPU
>>>>>>>>>>>>>>(12
>>>>> core
>>>>>>>>>>>>>> boxes). Nothing terribly informative
was in the logs
>>>>>>>>>>>>>> (initially
>>>>> we
>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>> some flush messages that seemed excessive,
but that was not
>>>>>>>>>>>>>> all
>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> time and we changed back to the standard
HBase WAL codec).
>>>>>>>>>>>>>>We
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> able
>>>>>>>>>>>> to
>>>>>>>>>>>>>> kill the unresponsive regionservers
and then restart them,
>>>>>>>>>>>>>>the
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> will be fine for a day or so but
will start to lock up
>>>>>>>>>>>>>>again.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We've dropped the entire HBase and
zookeper information and
>>>>> started
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> James Taylor suggested I send this
off here. I've attached a
>>>>> jstack
>>>>>>>>>>>>>> report of a locked up regionserver
in hopes that someone can
>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>> NOTICE: This message is intended for
the use of the
>>>>>>>>>>>>>individual
>>>>>>>>>>>>> or
>>>>>>>>>>>>> entity
>>>>>>>>>>>> to
>>>>>>>>>>>>> which it is addressed and may contain
information that is
>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>> privileged and exempt from disclosure
under applicable law.
>>>>>>>>>>>>>If
>>>>>>>>>>>>> the
>>>>>>>>>>>>> reader
>>>>>>>>>>>>> of this message is not the intended recipient,
you are hereby
>>>>>>> notified
>>>>>>>>>>>> that
>>>>>>>>>>>>> any printing, copying, dissemination,
distribution,
>>>>>>>>>>>>>disclosure
>>>>>>>>>>>>> or
>>>>>>>>>>>>> forwarding of this communication is strictly
prohibited. If
>>>>>>>>>>>>>you
>>>>> have
>>>>>>>>>>>>> received this communication in error,
please contact the
>>>>>>>>>>>>>sender
>>>>>>>>>>>> immediately
>>>>>>>>>>>>> and delete it from your system. Thank
You.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> NOTICE: This message is intended for the use of the individual
or
>>>>>>> entity to
>>>>>>>>> which it is addressed and may contain information that
is
>>>>> confidential,
>>>>>>>>> privileged and exempt from disclosure under applicable
law. If
>>>>>>>>>the
>>>>>>> reader
>>>>>>>>> of this message is not the intended recipient, you are
hereby
>>>>>>>>> notified
>>>>>>> that
>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
or
>>>>>>>>> forwarding of this communication is strictly prohibited.
If you
>>>>>>>>> have
>>>>>>>>> received this communication in error, please contact
the sender
>>>>>>> immediately
>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or
>>entity to 
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the
>>reader 
>> of this message is not the intended recipient, you are hereby notified
>>that 
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>>immediately 
>> and delete it from your system. Thank You.
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message