phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Zhong <jzh...@hortonworks.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Tue, 20 May 2014 21:03:55 GMT

I rechecked your RPC log and found that you didn't turn on the RPC trace
level. You can set the following so that we can see detail request &
response. 

log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE

If you didn't see time out exception in your region server logs, it means
that there are something keeping issuing lots of Get requests. Have you
used secondary index for this table? If that's the case, please
disable/drop it to see if that helps.




On 5/20/14 1:21 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:

>I will see how possible a new, all bare metal cluster is, but we have 5
>other similarly configured clusters with no troubles. We only have an
>issue when Phoenix does a full scan of some tables that a handler gets
>stuck in 
>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>after it has returned all of its data.
>
>thank you,
>
>-chris
>
>On May 20, 2014, at 12:34 PM, alex kamil <alex.kamil@gmail.com> wrote:
>
>> i'm just guessing but may be region servers are stuck because they can't
>> communicate with zookeeper or hmaster, or may be datanodes can't talk to
>> namenode because of some vmware networking quirk and some internal
>>thread
>> is hanging, who knows,
>> it would take 30 min to setup up bare metal cluster vs days debugging
>>this
>> hybrid setup
>> also I'd suggest to install directly from hbase/hadoop websites  to
>> eliminate additional variables
>> 
>> 
>> 
>> 
>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
>><cft@biotiquesystems.com>wrote:
>> 
>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
>>> datanodes/regionservers with the stuck processes all bare metal.
>>> 
>>> -chris
>>> 
>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.kamil@gmail.com> wrote:
>>> 
>>>>> "with 3 VMWare "master" nodes"
>>>> 
>>>> can you try running hbase/phoenix on physical nodes instead of using
>>>> virtual machines
>>>> 
>>>> 
>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas <cft@biotiquesystems.com
>>>> wrote:
>>>> 
>>>>> Sorry to follow up on my own message, but I was wondering if anyone
>>>>>had
>>>>> any ideas? Normal non-phoenix scans don't cause this symptom, but
>>>>>right
>>>>> after a select * on the exact same table will.
>>>>> 
>>>>> If we export the table and then re-import it into a new table, the
>>>>>new
>>>>> table doesn't exhibit these symptoms, same as if we use an
>>> upsert..select
>>>>> to do a copy. It seems something happens to the last region to cause
>>> this,
>>>>> but it is not directly data dependent. Moving the region to another
>>>>> regionserver doesn't have any effect - just moves where the problem
>>>>> happens. Major compactions get hung up by the running threads as they
>>>>> probably have a lock.
>>>>> 
>>>>> I've run the hfile tool on the final region and nothing seems awry.
>>>>> 
>>>>> Figuring this out will allow this project to continue, as of now it
>>>>>is
>>>>> hung up on this issue.
>>>>> 
>>>>> thank you,
>>>>> -chris
>>>>> 
>>>>> 
>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas <cft@biotiquesystems.com>
>>> wrote:
>>>>> 
>>>>>> I did some "poor mans" profiling with multiple jstack and came up
>>>>>>with
>>>>> where the RpcServer.handler appears to be stuck:
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
>>>>>.
>>>>> That is the last deepest method in all of the traces, either
>>>>> HStore.java:1712 or HStore.java:1722. Here are two example traces
>>>>>for a
>>>>> thread (which has been running for the last couple hours):
>>>>>> 
>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>tid=0x0000000000cdb800
>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNext(Fa
>>>stDiffDeltaEncoder.java:540)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEnc
>>>odedSeeker.next(BufferedDataBlockEncoder.java:261)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFi
>>>leReaderV2.java:1063)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
>>>e.java:1776)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>Store.java:1722)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>ava:1655)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.java:1826)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:2841)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>callBlockingMethod(ClientProtos.java:28857)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>eduler.java:160)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>uler.java:38)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>java:110)
>>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>>> 
>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
>>>>>>tid=0x0000000000cdb800
>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
>>>>>> java.lang.Thread.State: RUNNABLE
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1944
>>>)
>>>>>>     at
>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.rootBl
>>>ockContainingKey(HFileBlockIndex.java:392)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDa
>>>taBlockWithScanInfo(HFileBlockIndex.java:209)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekTo
>>>DataBlock(HFileBlockIndex.java:179)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBef
>>>ore(HFileReaderV2.java:548)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
>>>Store.java:1712)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
>>>ava:1655)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
>>>.java:1826)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
>>>a:2841)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
>>>callBlockingMethod(ClientProtos.java:28857)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>>>>>>     at 
>>>>>>org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
>>>eduler.java:160)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
>>>uler.java:38)
>>>>>>     at
>>>>> 
>>> 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
>>>java:110)
>>>>>>     at java.lang.Thread.run(Thread.java:744)
>>>>>> 
>>>>>> 
>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong <jzhong@hortonworks.com>
>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hey Chris,
>>>>>>> 
>>>>>>> I used performance.py tool which created a table with 50K rows
in
>>>>>>>one
>>>>>>> table, run the following query from sqlline.py and everything
seems
>>> fine
>>>>>>> without seeing CPU running hot.
>>>>>>> 
>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
from
>>>>>>> PERFORMANCE_50000;
>>>>>>> +------------+
>>>>>>> |  COUNT(1)  |
>>>>>>> +------------+
>>>>>>> | 50000      |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.166 seconds)
>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
from
>>>>>>> PERFORMANCE_50000;
>>>>>>> +------------+
>>>>>>> |  COUNT(1)  |
>>>>>>> +------------+
>>>>>>> | 50000      |
>>>>>>> +------------+
>>>>>>> 1 row selected (0.167 seconds)
>>>>>>> 
>>>>>>> Is there anyway could you run profiler to see where the CPU goes?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com>
wrote:
>>>>>>> 
>>>>>>>> Ahh, yes. Here is a pastebin for it:
>>>>>>>> 
>>>>>>>> http://pastebin.com/w6mtabag
>>>>>>>> 
>>>>>>>> thanks again,
>>>>>>>> -chris
>>>>>>>> 
>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Chris,
>>>>>>>>> 
>>>>>>>>> Attachments are filtered out by the mail server. Can
you
>>>>>>>>>pastebin it
>>>>>>>>> some
>>>>>>>>> place?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
>>>>>>>>> <cft@biotiquesystems.com>wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears
to have
>>> been
>>>>>>>>>> set
>>>>>>>>>> to 60 by Ambari during install process). Now we have
narrowed
>>>>>>>>>>down
>>>>> what
>>>>>>>>>> causes the CPU to increase and have some detailed
logs:
>>>>>>>>>> 
>>>>>>>>>> If we connect using sqlline.py and execute a select
that selects
>>> one
>>>>>>>>>> row
>>>>>>>>>> using the primary_key, no increate in CPU is observed
and the
>>> number
>>>>>>>>>> of RPC
>>>>>>>>>> threads in a RUNNABLE state remains the same.
>>>>>>>>>> 
>>>>>>>>>> If we execute a select that scans the table such
as "select
>>> count(*)
>>>>>>>>>> from
>>>>>>>>>> TABLE" or where the "where" clause only limits on
non-primary
>>>>>>>>>>key
>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
>>>>>>>>>>threads
>>>>>>>>>> increases
>>>>>>>>>> and the CPU utilization of the regionserver increases
by ~105%.
>>>>>>>>>> 
>>>>>>>>>> Disconnecting the client does not have an effect
and the
>>>>>>>>>> RpcServer.handler
>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the
higher usage.
>>>>>>>>>> 
>>>>>>>>>> Checking the Web Console for the Regionserver just
shows 10
>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no
other
>>>>>>>>>>monitored
>>>>>>>>>> tasks
>>>>>>>>>> are happening. The regionserver has a Max Heap of
10G and a Used
>>> heap
>>>>>>>>>> of
>>>>>>>>>> 445.2M.
>>>>>>>>>> 
>>>>>>>>>> I've attached the regionserver log with IPC debug
logging
>>>>>>>>>>turned on
>>>>>>>>>> right
>>>>>>>>>> when one of the Phoenix statements is executed (this
statement
>>>>> actually
>>>>>>>>>> used up the last available handler).
>>>>>>>>>> 
>>>>>>>>>> thanks,
>>>>>>>>>> -chris
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
>>>>>>>>>><jzhong@hortonworks.com
>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> From the stack, it seems you increase the default
rpc handler
>>> number
>>>>>>>>>>> to
>>>>>>>>>>> about 60. All handlers are serving Get request(You
can search
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>> 
>>>>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.j
>>>>>av
>>>>>>>>>> a:2
>>>>>>>>>>> 841).
>>>>>>>>>>> 
>>>>>>>>>>> You can check why there are so many get requests
by adding some
>>> log
>>>>>>>>>>> info
>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease
the number
>>>>>>>>>>>of
>>> rpc
>>>>>>>>>>> handlers per region server, it will mitigate
your current
>>>>>>>>>>>issue.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cft@biotiquesystems.com>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers
CPU usage
>>>>>>>>>>>> spiking
>>>>>>>>>> up
>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
>>>>>>>>>>>> 
>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute
node cluster
>>> (with
>>>>> 3
>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix
on it. It is a
>>>>>>>>>>>>plain
>>>>>>>>>>>> Ambari
>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP
Phoenix RPM release
>>>>>>>>>>>>and
>>>>> hand
>>>>>>>>>>>> linked in the jar files to the hadoop lib.
Everything was
>>>>>>>>>>>>going
>>>>> well
>>>>>>>>>>>> and
>>>>>>>>>>>> we were able to load in ~30k records into
several tables. What
>>>>>>>>>>>> happened
>>>>>>>>>>>> was after about 3-4 days of being up the
regionservers became
>>>>>>>>>>>> unresponsive and started to use most of the
available CPU (12
>>> core
>>>>>>>>>>>> boxes). Nothing terribly informative was
in the logs
>>>>>>>>>>>>(initially
>>> we
>>>>>>>>>>>> saw
>>>>>>>>>>>> some flush messages that seemed excessive,
but that was not
>>>>>>>>>>>>all
>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>> time and we changed back to the standard
HBase WAL codec). We
>>>>>>>>>>>>are
>>>>>>>>>>>> able
>>>>>>>>>> to
>>>>>>>>>>>> kill the unresponsive regionservers and then
restart them, the
>>>>>>>>>>>> cluster
>>>>>>>>>>>> will be fine for a day or so but will start
to lock up again.
>>>>>>>>>>>> 
>>>>>>>>>>>> We've dropped the entire HBase and zookeper
information and
>>> started
>>>>>>>>>>>> from
>>>>>>>>>>>> scratch, but that has not helped.
>>>>>>>>>>>> 
>>>>>>>>>>>> James Taylor suggested I send this off here.
I've attached a
>>> jstack
>>>>>>>>>>>> report of a locked up regionserver in hopes
that someone can
>>>>>>>>>>>>shed
>>>>>>>>>>>> some
>>>>>>>>>>>> light.
>>>>>>>>>>>> 
>>>>>>>>>>>> thanks,
>>>>>>>>>>>> -chris
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>>> NOTICE: This message is intended for the use
of the individual
>>>>>>>>>>>or
>>>>>>>>>>> entity
>>>>>>>>>> to
>>>>>>>>>>> which it is addressed and may contain information
that is
>>>>>>>>>>> confidential,
>>>>>>>>>>> privileged and exempt from disclosure under applicable
law. If
>>>>>>>>>>>the
>>>>>>>>>>> reader
>>>>>>>>>>> of this message is not the intended recipient,
you are hereby
>>>>> notified
>>>>>>>>>> that
>>>>>>>>>>> any printing, copying, dissemination, distribution,
disclosure
>>>>>>>>>>>or
>>>>>>>>>>> forwarding of this communication is strictly
prohibited. If you
>>> have
>>>>>>>>>>> received this communication in error, please
contact the sender
>>>>>>>>>> immediately
>>>>>>>>>>> and delete it from your system. Thank You.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>> NOTICE: This message is intended for the use of the individual
or
>>>>> entity to
>>>>>>> which it is addressed and may contain information that is
>>> confidential,
>>>>>>> privileged and exempt from disclosure under applicable law. If
the
>>>>> reader
>>>>>>> of this message is not the intended recipient, you are hereby
>>>>>>>notified
>>>>> that
>>>>>>> any printing, copying, dissemination, distribution, disclosure
or
>>>>>>> forwarding of this communication is strictly prohibited. If you
>>>>>>>have
>>>>>>> received this communication in error, please contact the sender
>>>>> immediately
>>>>>>> and delete it from your system. Thank You.
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message