phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@biotiquesystems.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Wed, 21 May 2014 17:46:50 GMT
We will try out the two work-arounds today. I've added the TRACE logging, here is the log on
an affected regionserver when running a problem query/table combination, below it is the new
stuck thread stack trace:

-------------------------------------

2014-05-21 12:44:39,280 DEBUG [RpcServer.listener,port=60020] ipc.RpcServer: RpcServer.listener,port=60020:
connection from 10.56.200.108:39956; # active connections: 1
2014-05-21 12:44:39,280 DEBUG [RpcServer.reader=7,port=60020] ipc.RpcServer: Authorized user_info
{ effective_user: "tarnasch" } service_name: "ClientService" cell_block_codec_class: "org.apache.hadoop.hbase.codec.KeyValueCodec"
2014-05-21 12:44:39,281 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader
call_id: 22 method_name: "Scan" request_param: true totalRequestSize: 229 bytes
2014-05-21 12:44:39,282 DEBUG [RpcServer.handler=6,port=60020] ipc.RpcServer: callId: 22 service:
ClientService methodName: Scan size: 229 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,284 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader
call_id: 28 method_name: "Get" request_param: true totalRequestSize: 94 bytes
2014-05-21 12:44:39,284 INFO  [RpcServer.handler=6,port=60020] coprocessor.UngroupedAggregateRegionObserver:
Starting ungrouped coprocessor scan {"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-21 12:44:39,285 DEBUG [RpcServer.handler=7,port=60020] ipc.RpcServer: callId: 28 service:
ClientService methodName: Get size: 94 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,706 INFO  [RpcServer.handler=6,port=60020] coprocessor.UngroupedAggregateRegionObserver:
Finished scanning 18608 rows for ungrouped coprocessor scan {"timeRange":[0,1400694279223],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00\\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"cacheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":1,"filter":"FirstKeyOnlyFilter","caching":-1}
2014-05-21 12:44:39,707 TRACE [RpcServer.handler=6,port=60020] ipc.RpcServer: callId: 22 service:
ClientService methodName: Scan size: 229 connection: 10.56.200.108:39956 param: region { type:
REGION_NAME value: "ACRE.GE,\003\000\000\000\000,1399489118008.78e15ac392815f1a8f2b5a53b80a0843."
} scan { column { family: "0" } attribute { name: "_ondemand_" value: "\001" } attribute {
name: "_Aggs" value: "\001\001\030\001\003\005\200\000\000\001\002\003" } attribute { name:
"_UngroupedAgg" value: "\001" } start_row: "\003\000\000\000\000" filter { name: "org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter"
serialized_filter: "" } time_range { from: 0 to: 1400694279223 } max_versions: 1 cache_blocks:
true } number_of_rows: 0 close_scanner: false connection: 10.56.200.108:39956, response scanner_id:
5393104135575718994 more_results: true ttl: 60000 queueTime: 0 processingTime: 424
2014-05-21 12:44:39,707 DEBUG [RpcServer.handler=6,port=60020] ipc.RpcServer: RpcServer.responder:
callId: 22 wrote 24 bytes.
2014-05-21 12:44:39,707 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader
call_id: 59 method_name: "Scan" request_param: true totalRequestSize: 28 bytes
2014-05-21 12:44:39,708 DEBUG [RpcServer.handler=5,port=60020] ipc.RpcServer: callId: 59 service:
ClientService methodName: Scan size: 28 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,708 TRACE [RpcServer.handler=5,port=60020] ipc.RpcServer: callId: 59 service:
ClientService methodName: Scan size: 28 connection: 10.56.200.108:39956 param: scanner_id:
5393104135575718994 number_of_rows: 100 close_scanner: false next_call_seq: 0 connection:
10.56.200.108:39956, response cells_per_result: 1 scanner_id: 5393104135575718994 more_results:
true ttl: 60000 queueTime: 0 processingTime: 0
2014-05-21 12:44:39,709 DEBUG [RpcServer.handler=5,port=60020] ipc.RpcServer: RpcServer.responder:
callId: 59 wrote 65 bytes.
2014-05-21 12:44:39,710 TRACE [RpcServer.reader=7,port=60020] ipc.RpcServer: RequestHeader
call_id: 60 method_name: "Scan" request_param: true totalRequestSize: 26 bytes
2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020] ipc.RpcServer: callId: 60 service:
ClientService methodName: Scan size: 26 connection: 10.56.200.108:39956 executing as tarnasch
2014-05-21 12:44:39,710 TRACE [RpcServer.handler=2,port=60020] ipc.RpcServer: callId: 60 service:
ClientService methodName: Scan size: 26 connection: 10.56.200.108:39956 param: scanner_id:
5393104135575718994 number_of_rows: 0 close_scanner: true connection: 10.56.200.108:39956,
response scanner_id: 5393104135575718994 more_results: false queueTime: 0 processingTime:
0
2014-05-21 12:44:39,710 DEBUG [RpcServer.handler=2,port=60020] ipc.RpcServer: RpcServer.responder:
callId: 60 wrote 20 bytes.

-------------------------------------

The new corresponding stack thread trace is:

"RpcServer.handler=7,port=60020" daemon prio=10 tid=0x00007ff7b06fc800 nid=0x7864 runnable
[0x00007ff78e028000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
	at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
	at java.lang.Thread.run(Thread.java:744)


thanks

On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:

> 
> 
> Hey Chris,
> 
> Sorry for the trace option. The right option should be below. Before you
> were able to turn on DEBUG, you just need to change it to TRACE.
> 
>    log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> 
> 
> 
> Do you see that the two read requests burn one of your CPU to 100%? From
> the call stacks you posted today and before, it seems there might be an
> infinite loop inside the function Htable.getRowOrBefore() which is
> deprecated and supposed to only work for meta region.(brought by Enis).
> Fortunately there is only one place in phoenix call this function, so
> could you try the following things?
> 
> set following configuration in client side hbase-site.xml to disable stats
> update 
>   phoenix.query.statsUpdateFrequency = 2147483647
>   phoenix.query.maxStatsAge = 0
> 
> After the configuration changes, could you verify if the CPU is still
> spiking for the same query?
> 
> You can also do the following:
> 
> Alter your table's ColumnFamily data block encoding to NONE as
> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that through
> Hbase client.
> 
> 
> We suspect that the combination of your data, FAST_DIFF data block
> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> 
> 
> 
> On 5/20/14 3:03 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> 
>> Ho Jeffrey.
>> 
>> Thank you for taking a look. I added
>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the Custom
>> log4j.properties in Ambari and did a rolling restart of the regionservers
>> (and verified it was pushed out the nodes in
>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on the
>> table. Looking at the logs I don't see any TRACE messages. The only two
>> log messages that happened when issuing the select are:
>> 
>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
>> coprocessor scan 
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
>> rows for ungrouped coprocessor scan
>> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x00
>> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"ca
>> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions":
>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
>> 
>> The metrics for the regionserver show only 2 total read requests since it
>> started and 0 per second now. Those two requests are for the table I did
>> the count on. The regionserver's UI says it has no active RPC calls.  It
>> currently has one stuck handler from running that command:
>> 
>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
>> nid=0x2582 runnable [0x00007fbd5ba6c000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 888)
>> 	at 
>> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java:2
>> 880)
>> 	at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
>> 	at java.util.TreeMap.getEntry(TreeMap.java:340)
>> 	at java.util.TreeMap.get(TreeMap.java:273)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDeleted(
>> GetClosestRowBeforeTracker.java:128)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandida
>> te(GetClosestRowBeforeTracker.java:107)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(Get
>> ClosestRowBeforeTracker.java:202)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.
>> java:1772)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HSt
>> ore.java:1722)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.jav
>> a:1655)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.j
>> ava:1826)
>> 	at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:
>> 2841)
>> 	at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.ca
>> llBlockingMethod(ClientProtos.java:28857)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSched
>> uler.java:160)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSchedul
>> er.java:38)
>> 	at 
>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.ja
>> va:110)
>> 	at java.lang.Thread.run(Thread.java:744)
>> 
>> It also has one listener running:
>> 
>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
>> nid=0x2579 runnable [0x00007fbd5c376000]
>>  java.lang.Thread.State: RUNNABLE
>> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>> 	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>> 	- locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
>> 	- locked <0x000000059ae289a0> (a java.util.Collections$UnmodifiableSet)
>> 	- locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>> 	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
>> 	at org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
>> 
>> If there is anything else you need, please let me know.
>> 
>> thanks again,
>> -chris
>> 
>> 
>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:
>> 
>>> 
>>> I rechecked your RPC log and found that you didn't turn on the RPC trace
>>> level. You can set the following so that we can see detail request &
>>> response. 
>>> 
>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
>>> 
>>> If you didn't see time out exception in your region server logs, it
>>> means
>>> that there are something keeping issuing lots of Get requests. Have you
>>> used secondary index for this table? If that's the case, please
>>> disable/drop it to see if that helps.
>>> 
>>> 
>>> 
>>> 
>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
>>> 
>>>> I will see how possible a new, all bare metal cluster is, but we have 5
>>>> other similarly configured clusters with no troubles. We only have an
>>>> issue when Phoenix does a full scan of some tables that a handler gets
>>>> stuck in 
>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>> after it has returned all of its data.
>>>> 
>>>> thank you,
>>>> 
>>>> -chris
>>>> 
>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.kamil@gmail.com> wrote:
>>>> 
>>>>> i'm just guessing but may be region servers are stuck because they
>>>>> can't
>>>>> communicate with zookeeper or hmaster, or may be datanodes can't talk
>>>>> to
>>>>> namenode because of some vmware networking quirk and some internal
>>>>> thread
>>>>> is hanging, who knows,
>>>>> it would take 30 min to setup up bare metal cluster vs days debugging
>>>>> this
>>>>> hybrid setup
>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
>>>>> eliminate additional variables
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>>>>> <cft@biotiquesystems.com>wrote:
>>>>> 
>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>>>>> datanodes/regionservers with the stuck processes all bare metal.
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.kamil@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>>> "with 3 VMWare "master" nodes"
>>>>>>> 
>>>>>>> can you try running hbase/phoenix on physical nodes instead of
using
>>>>>>> virtual machines
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
>>>>>>> <cft@biotiquesystems.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry to follow up on my own message, but I was wondering
if anyone
>>>>>>>> had
>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom,
but
>>>>>>>> right
>>>>>>>> after a select * on the exact same table will.
>>>>>>>> 
>>>>>>>> If we export the table and then re-import it into a new table,
the
>>>>>>>> new
>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
>>>>>> upsert..select
>>>>>>>> to do a copy. It seems something happens to the last region
to
>>>>>>>> cause
>>>>>> this,
>>>>>>>> but it is not directly data dependent. Moving the region
to another
>>>>>>>> regionserver doesn't have any effect - just moves where the
problem
>>>>>>>> happens. Major compactions get hung up by the running threads
as
>>>>>>>> they
>>>>>>>> probably have a lock.
>>>>>>>> 
>>>>>>>> I've run the hfile tool on the final region and nothing seems
awry.
>>>>>>>> 
>>>>>>>> Figuring this out will allow this project to continue, as
of now it
>>>>>>>> is
>>>>>>>> hung up on this issue.
>>>>>>>> 
>>>>>>>> thank you,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cft@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I did some "poor mans" profiling with multiple jstack
and came up
>>>>>>>>> with
>>>>>>>> where the RpcServer.handler appears to be stuck:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
>>>>>>>> le
>>>>>>>> .
>>>>>>>> That is the last deepest method in all of the traces, either
>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example
traces
>>>>>>>> for a
>>>>>>>> thread (which has been running for the last couple hours):
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(
>>>>>> Fa
>>>>>> stDiffDeltaEncoder.java:540)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedE
>>>>>> nc
>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(H
>>>>>> Fi
>>>>>> leReaderV2.java:1063)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HSt
>>>>>> or
>>>>>> e.java:1776)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1722)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>>>> tid=0x0000000000cdb800
>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:19
>>>>>> 44
>>>>>> )
>>>>>>>>>   at
>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.root
>>>>>> Bl
>>>>>> ockContainingKey(HFileBlockIndex.java:392)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.load
>>>>>> Da
>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seek
>>>>>> To
>>>>>> DataBlock(HFileBlockIndex.java:179)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekB
>>>>>> ef
>>>>>> ore(HFileReaderV2.java:548)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>> (H
>>>>>> Store.java:1712)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore
>>>>>> .j
>>>>>> ava:1655)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegi
>>>>>> on
>>>>>> .java:1826)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>> av
>>>>>> a:2841)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$
>>>>>> 2.
>>>>>> callBlockingMethod(ClientProtos.java:28857)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>>>>   at 
>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcS
>>>>>> ch
>>>>>> eduler.java:160)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSch
>>>>>> ed
>>>>>> uler.java:38)
>>>>>>>>>   at
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedule
>>>>>> r.
>>>>>> java:110)
>>>>>>>>>   at java.lang.Thread.run(Thread.java:744)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
>>>>>>>>> <jzhong@hortonworks.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hey Chris,
>>>>>>>>>> 
>>>>>>>>>> I used performance.py tool which created a table
with 50K rows in
>>>>>>>>>> one
>>>>>>>>>> table, run the following query from sqlline.py and
everything
>>>>>>>>>> seems
>>>>>> fine
>>>>>>>>>> without seeing CPU running hot.
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select
count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.166 seconds)
>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select
count(*) from
>>>>>>>>>> PERFORMANCE_50000;
>>>>>>>>>> +------------+
>>>>>>>>>> |  COUNT(1)  |
>>>>>>>>>> +------------+
>>>>>>>>>> | 50000      |
>>>>>>>>>> +------------+
>>>>>>>>>> 1 row selected (0.167 seconds)
>>>>>>>>>> 
>>>>>>>>>> Is there anyway could you run profiler to see where
the CPU goes?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>>>>> 
>>>>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>>>>> 
>>>>>>>>>>> thanks again,
>>>>>>>>>>> -chris
>>>>>>>>>>> 
>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>> 
>>>>>>>>>>>> Attachments are filtered out by the mail
server. Can you
>>>>>>>>>>>> pastebin it
>>>>>>>>>>>> some
>>>>>>>>>>>> place?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>>>>> <cft@biotiquesystems.com>wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We set the HBase RegionServer Handler
to 10 (it appears to
>>>>>>>>>>>>> have
>>>>>> been
>>>>>>>>>>>>> set
>>>>>>>>>>>>> to 60 by Ambari during install process).
Now we have narrowed
>>>>>>>>>>>>> down
>>>>>>>> what
>>>>>>>>>>>>> causes the CPU to increase and have some
detailed logs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we connect using sqlline.py and execute
a select that
>>>>>>>>>>>>> selects
>>>>>> one
>>>>>>>>>>>>> row
>>>>>>>>>>>>> using the primary_key, no increate in
CPU is observed and the
>>>>>> number
>>>>>>>>>>>>> of RPC
>>>>>>>>>>>>> threads in a RUNNABLE state remains the
same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If we execute a select that scans the
table such as "select
>>>>>> count(*)
>>>>>>>>>>>>> from
>>>>>>>>>>>>> TABLE" or where the "where" clause only
limits on non-primary
>>>>>>>>>>>>> key
>>>>>>>>>>>>> attributes, then the number of RUNNABLE
RpcServer.handler
>>>>>>>>>>>>> threads
>>>>>>>>>>>>> increases
>>>>>>>>>>>>> and the CPU utilization of the regionserver
increases by
>>>>>>>>>>>>> ~105%.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Disconnecting the client does not have
an effect and the
>>>>>>>>>>>>> RpcServer.handler
>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays
at the higher usage.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Checking the Web Console for the Regionserver
just shows 10
>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING
state, no other
>>>>>>>>>>>>> monitored
>>>>>>>>>>>>> tasks
>>>>>>>>>>>>> are happening. The regionserver has a
Max Heap of 10G and a
>>>>>>>>>>>>> Used
>>>>>> heap
>>>>>>>>>>>>> of
>>>>>>>>>>>>> 445.2M.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I've attached the regionserver log with
IPC debug logging
>>>>>>>>>>>>> turned on
>>>>>>>>>>>>> right
>>>>>>>>>>>>> when one of the Phoenix statements is
executed (this statement
>>>>>>>> actually
>>>>>>>>>>>>> used up the last available handler).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>> -chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey
Zhong
>>>>>>>>>>>>> <jzhong@hortonworks.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From the stack, it seems you increase
the default rpc handler
>>>>>> number
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> about 60. All handlers are serving
Get request(You can search
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
>>>>>>>> .j
>>>>>>>> av
>>>>>>>>>>>>> a:2
>>>>>>>>>>>>>> 841).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You can check why there are so many
get requests by adding
>>>>>>>>>>>>>> some
>>>>>> log
>>>>>>>>>>>>>> info
>>>>>>>>>>>>>> or enable hbase rpc trace. I guess
if you decrease the number
>>>>>>>>>>>>>> of
>>>>>> rpc
>>>>>>>>>>>>>> handlers per region server, it will
mitigate your current
>>>>>>>>>>>>>> issue.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
<cft@biotiquesystems.com>
>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We have hit a problem with Phoenix
and regionservers CPU
>>>>>>>>>>>>>>> usage
>>>>>>>>>>>>>>> spiking
>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> to use all available CPU and
becoming unresponsive.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> After HDP 2.1 was released we
setup a 4 compute node cluster
>>>>>> (with
>>>>>>>> 3
>>>>>>>>>>>>>>> VMWare "master" nodes) to test
out Phoenix on it. It is a
>>>>>>>>>>>>>>> plain
>>>>>>>>>>>>>>> Ambari
>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added
the HDP Phoenix RPM release
>>>>>>>>>>>>>>> and
>>>>>>>> hand
>>>>>>>>>>>>>>> linked in the jar files to the
hadoop lib. Everything was
>>>>>>>>>>>>>>> going
>>>>>>>> well
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> we were able to load in ~30k
records into several tables.
>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>> was after about 3-4 days of being
up the regionservers
>>>>>>>>>>>>>>> became
>>>>>>>>>>>>>>> unresponsive and started to use
most of the available CPU
>>>>>>>>>>>>>>> (12
>>>>>> core
>>>>>>>>>>>>>>> boxes). Nothing terribly informative
was in the logs
>>>>>>>>>>>>>>> (initially
>>>>>> we
>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>> some flush messages that seemed
excessive, but that was not
>>>>>>>>>>>>>>> all
>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> time and we changed back to the
standard HBase WAL codec).
>>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> able
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> kill the unresponsive regionservers
and then restart them,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> will be fine for a day or so
but will start to lock up
>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We've dropped the entire HBase
and zookeper information and
>>>>>> started
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> James Taylor suggested I send
this off here. I've attached a
>>>>>> jstack
>>>>>>>>>>>>>>> report of a locked up regionserver
in hopes that someone can
>>>>>>>>>>>>>>> shed
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> light.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>> -chris
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>>>>> NOTICE: This message is intended
for the use of the
>>>>>>>>>>>>>> individual
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> entity
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> which it is addressed and may contain
information that is
>>>>>>>>>>>>>> confidential,
>>>>>>>>>>>>>> privileged and exempt from disclosure
under applicable law.
>>>>>>>>>>>>>> If
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reader
>>>>>>>>>>>>>> of this message is not the intended
recipient, you are hereby
>>>>>>>> notified
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> any printing, copying, dissemination,
distribution,
>>>>>>>>>>>>>> disclosure
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> forwarding of this communication
is strictly prohibited. If
>>>>>>>>>>>>>> you
>>>>>> have
>>>>>>>>>>>>>> received this communication in error,
please contact the
>>>>>>>>>>>>>> sender
>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>> and delete it from your system. Thank
You.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the
individual or
>>>>>>>> entity to
>>>>>>>>>> which it is addressed and may contain information
that is
>>>>>> confidential,
>>>>>>>>>> privileged and exempt from disclosure under applicable
law. If
>>>>>>>>>> the
>>>>>>>> reader
>>>>>>>>>> of this message is not the intended recipient, you
are hereby
>>>>>>>>>> notified
>>>>>>>> that
>>>>>>>>>> any printing, copying, dissemination, distribution,
disclosure or
>>>>>>>>>> forwarding of this communication is strictly prohibited.
If you
>>>>>>>>>> have
>>>>>>>>>> received this communication in error, please contact
the sender
>>>>>>>> immediately
>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or
>>> entity to 
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the
>>> reader 
>>> of this message is not the intended recipient, you are hereby notified
>>> that 
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender
>>> immediately 
>>> and delete it from your system. Thank You.
>> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


Mime
View raw message