phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@biotiquesystems.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Tue, 20 May 2014 18:24:59 GMT
Sorry to follow up on my own message, but I was wondering if anyone had any ideas? Normal non-phoenix
scans don't cause this symptom, but right after a select * on the exact same table will. 

If we export the table and then re-import it into a new table, the new table doesn't exhibit
these symptoms, same as if we use an upsert..select to do a copy. It seems something happens
to the last region to cause this, but it is not directly data dependent. Moving the region
to another regionserver doesn't have any effect - just moves where the problem happens. Major
compactions get hung up by the running threads as they probably have a lock. 

I've run the hfile tool on the final region and nothing seems awry.

Figuring this out will allow this project to continue, as of now it is hung up on this issue.

thank you,
-chris


On May 15, 2014, at 8:47 PM, Chris Tarnas <cft@biotiquesystems.com> wrote:

> I did some "poor mans" profiling with multiple jstack and came up with where the RpcServer.handler
appears to be stuck: org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile.
That is the last deepest method in all of the traces, either HStore.java:1712 or HStore.java:1722.
Here are two example traces for a thread (which has been running for the last couple hours):
> 
> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable
[0x00007f4b49e9e000]
>   java.lang.Thread.State: RUNNABLE
> 	at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(FastDiffDeltaEncoder.java:540)
> 	at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.next(BufferedDataBlockEncoder.java:261)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1063)
> 	at org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStore.java:1776)
> 	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1722)
> 	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> 	at java.lang.Thread.run(Thread.java:744)
> 
> "RpcServer.handler=1,port=60020" daemon prio=10 tid=0x0000000000cdb800 nid=0x727b runnable
[0x00007f4b49e9e000]
>   java.lang.Thread.State: RUNNABLE
> 	at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944)
> 	at org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBlockContainingKey(HFileBlockIndex.java:392)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:209)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:179)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:548)
> 	at org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(HStore.java:1712)
> 	at org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.java:1655)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion.java:1826)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2841)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28857)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> 	at java.lang.Thread.run(Thread.java:744)
> 
> 
> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:
> 
>> 
>> Hey Chris,
>> 
>> I used performance.py tool which created a table with 50K rows in one
>> table, run the following query from sqlline.py and everything seems fine
>> without seeing CPU running hot.
>> 
>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>> PERFORMANCE_50000;
>> +------------+
>> |  COUNT(1)  |
>> +------------+
>> | 50000      |
>> +------------+
>> 1 row selected (0.166 seconds)
>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
>> PERFORMANCE_50000;
>> +------------+
>> |  COUNT(1)  |
>> +------------+
>> | 50000      |
>> +------------+
>> 1 row selected (0.167 seconds)
>> 
>> Is there anyway could you run profiler to see where the CPU goes?
>> 
>> 
>> 
>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
>> 
>>> Ahh, yes. Here is a pastebin for it:
>>> 
>>> http://pastebin.com/w6mtabag
>>> 
>>> thanks again,
>>> -chris
>>> 
>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>> 
>>>> Hi Chris,
>>>> 
>>>> Attachments are filtered out by the mail server. Can you pastebin it
>>>> some
>>>> place?
>>>> 
>>>> Thanks,
>>>> Nick
>>>> 
>>>> 
>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>> <cft@biotiquesystems.com>wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> We set the HBase RegionServer Handler to 10 (it appears to have been
>>>>> set
>>>>> to 60 by Ambari during install process). Now we have narrowed down what
>>>>> causes the CPU to increase and have some detailed logs:
>>>>> 
>>>>> If we connect using sqlline.py and execute a select that selects one
>>>>> row
>>>>> using the primary_key, no increate in CPU is observed and the number
>>>>> of RPC
>>>>> threads in a RUNNABLE state remains the same.
>>>>> 
>>>>> If we execute a select that scans the table such as "select count(*)
>>>>> from
>>>>> TABLE" or where the "where" clause only limits on non-primary key
>>>>> attributes, then the number of RUNNABLE RpcServer.handler threads
>>>>> increases
>>>>> and the CPU utilization of the regionserver increases by ~105%.
>>>>> 
>>>>> Disconnecting the client does not have an effect and the
>>>>> RpcServer.handler
>>>>> thread is left RUNNABLE and the CPU stays at the higher usage.
>>>>> 
>>>>> Checking the Web Console for the Regionserver just shows 10
>>>>> RpcServer.reader tasks, all in a WAITING state, no other monitored
>>>>> tasks
>>>>> are happening. The regionserver has a Max Heap of 10G and a Used heap
>>>>> of
>>>>> 445.2M.
>>>>> 
>>>>> I've attached the regionserver log with IPC debug logging turned on
>>>>> right
>>>>> when one of the Phoenix statements is executed (this statement actually
>>>>> used up the last available handler).
>>>>> 
>>>>> thanks,
>>>>> -chris
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzhong@hortonworks.com>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> From the stack, it seems you increase the default rpc handler number
>>>>>> to
>>>>>> about 60. All handlers are serving Get request(You can search
>>>>>> 
>>>>> 
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>>> a:2
>>>>>> 841).
>>>>>> 
>>>>>> You can check why there are so many get requests by adding some log
>>>>>> info
>>>>>> or enable hbase rpc trace. I guess if you decrease the number of
rpc
>>>>>> handlers per region server, it will mitigate your current issue.
>>>>>> 
>>>>>> 
>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cft@biotiquesystems.com>
wrote:
>>>>>> 
>>>>>>> We have hit a problem with Phoenix and regionservers CPU usage
>>>>>>> spiking
>>>>> up
>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>> 
>>>>>>> After HDP 2.1 was released we setup a 4 compute node cluster
(with 3
>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
>>>>>>> Ambari
>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release
and hand
>>>>>>> linked in the jar files to the hadoop lib. Everything was going
well
>>>>>>> and
>>>>>>> we were able to load in ~30k records into several tables. What
>>>>>>> happened
>>>>>>> was after about 3-4 days of being up the regionservers became
>>>>>>> unresponsive and started to use most of the available CPU (12
core
>>>>>>> boxes). Nothing terribly informative was in the logs (initially
we
>>>>>>> saw
>>>>>>> some flush messages that seemed excessive, but that was not all
of
>>>>>>> the
>>>>>>> time and we changed back to the standard HBase WAL codec). We
are
>>>>>>> able
>>>>> to
>>>>>>> kill the unresponsive regionservers and then restart them, the
>>>>>>> cluster
>>>>>>> will be fine for a day or so but will start to lock up again.
>>>>>>> 
>>>>>>> We've dropped the entire HBase and zookeper information and started
>>>>>>> from
>>>>>>> scratch, but that has not helped.
>>>>>>> 
>>>>>>> James Taylor suggested I send this off here. I've attached a
jstack
>>>>>>> report of a locked up regionserver in hopes that someone can
shed
>>>>>>> some
>>>>>>> light.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> -chris
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>> entity
>>>>> to
>>>>>> which it is addressed and may contain information that is
>>>>>> confidential,
>>>>>> privileged and exempt from disclosure under applicable law. If the
>>>>>> reader
>>>>>> of this message is not the intended recipient, you are hereby notified
>>>>> that
>>>>>> any printing, copying, dissemination, distribution, disclosure or
>>>>>> forwarding of this communication is strictly prohibited. If you have
>>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>>> and delete it from your system. Thank You.
>>>>> 
>>>>> 
>>>>> 
>> 
>> 
>> 
>> -- 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to 
>> which it is addressed and may contain information that is confidential, 
>> privileged and exempt from disclosure under applicable law. If the reader 
>> of this message is not the intended recipient, you are hereby notified that 
>> any printing, copying, dissemination, distribution, disclosure or 
>> forwarding of this communication is strictly prohibited. If you have 
>> received this communication in error, please contact the sender immediately 
>> and delete it from your system. Thank You.
> 


Mime
View raw message