phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <e...@hortonworks.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Wed, 21 May 2014 22:58:00 GMT
Hey Chris,

Thanks for bearing with us. It seems that the FAST_DIFF encoding + your
data schema + use of getRowOrBefore() caused the issue. That is why Jeffrey
was asking for the data. It is fine. getRowOrBefore() will go away anyway (
https://issues.apache.org/jira/browse/HBASE-11230). Jeffrey's patch changes
Phoenix to use a "reverse scan" which should not have an issue with
FAST_DIFF. It would be awesome if you can test it with your data so that we
can see whether it is a problem with reverse lookups in FAST_DIFF encoding,
or a specific problem with getRowOrBefore()

Thanks for reporting.
Enis


On Wed, May 21, 2014 at 3:44 PM, Chris Tarnas <cft@biotiquesystems.com>wrote:

> I'll start on building a patched version then trying it out (Now that it
> is working I'll have to wrangle the cluster away from the users). I cannot
> send the data, it private data.
>
> thanks,
> -chris
>
>
> On May 21, 2014, at 3:04 PM, Jeffrey Zhong <jzhong@hortonworks.com> wrote:
>
> >
> > Thanks for reporting the status back. Mind to try the patch on
> phoenix-992
> > or attach the data which can repro the issue so that we can verify
> current
> > fix. Thanks.
> >
> > On 5/21/14 2:06 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> >
> >>
> >> Altering the table to not use a DATA_BLOCK_ENCODING fixed the problem.
> >> Just to verify that it was the fix, I altered it back to FAST_DIFF and
> >> the problem re-appeared. Thank you!
> >>
> >> Adding the changes to the client hbase-site.xml did not help the issue,
> >> but it may be I was not sourcing it correctly in sqlline. I made sure it
> >> was in the classpath being passed to java via -cp when launching it.
> >>
> >> thanks again,
> >> -chris
> >>
> >> On May 20, 2014, at 5:50 PM, Jeffrey Zhong <jzhong@hortonworks.com>
> wrote:
> >>
> >>>
> >>>
> >>> Hey Chris,
> >>>
> >>> Sorry for the trace option. The right option should be below. Before
> you
> >>> were able to turn on DEBUG, you just need to change it to TRACE.
> >>>
> >>>   log4j.logger.org.apache.hadoop.ipc.RpcServer=TRACE
> >>>
> >>>
> >>>
> >>> Do you see that the two read requests burn one of your CPU to 100%?
> From
> >>> the call stacks you posted today and before, it seems there might be an
> >>> infinite loop inside the function Htable.getRowOrBefore() which is
> >>> deprecated and supposed to only work for meta region.(brought by Enis).
> >>> Fortunately there is only one place in phoenix call this function, so
> >>> could you try the following things?
> >>>
> >>> set following configuration in client side hbase-site.xml to disable
> >>> stats
> >>> update
> >>>  phoenix.query.statsUpdateFrequency = 2147483647
> >>>  phoenix.query.maxStatsAge = 0
> >>>
> >>> After the configuration changes, could you verify if the CPU is still
> >>> spiking for the same query?
> >>>
> >>> You can also do the following:
> >>>
> >>> Alter your table's ColumnFamily data block encoding to NONE as
> >>> DATA_BLOCK_ENCODING => 'NONE' and do a major compaction after that
> >>> through
> >>> Hbase client.
> >>>
> >>>
> >>> We suspect that the combination of your data, FAST_DIFF data block
> >>> encoding & the deprecated Htable.getRowOrBefore() cause the issue.
> >>>
> >>>
> >>>
> >>> On 5/20/14 3:03 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> >>>
> >>>> Ho Jeffrey.
> >>>>
> >>>> Thank you for taking a look. I added
> >>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE to the
> >>>> Custom
> >>>> log4j.properties in Ambari and did a rolling restart of the
> >>>> regionservers
> >>>> (and verified it was pushed out the nodes in
> >>>> /etc/hbase/conf/log4j.properties ) then issued a "select count(*)" on
> >>>> the
> >>>> table. Looking at the logs I don't see any TRACE messages. The only
> two
> >>>> log messages that happened when issuing the select are:
> >>>>
> >>>> 2014-05-20 16:43:53,609 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Starting ungrouped
> >>>> coprocessor scan
> >>>>
> >>>>
> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>>
> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>>
> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>> 2014-05-20 16:43:54,309 INFO  [RpcServer.handler=9,port=60020]
> >>>> coprocessor.UngroupedAggregateRegionObserver: Finished scanning 18608
> >>>> rows for ungrouped coprocessor scan
> >>>>
> >>>>
> {"timeRange":[0,1400622233277],"batch":-1,"startRow":"\\x03\\x00\\x00\\x
> >>>> 00
> >>>>
> >>>>
> \\x00","stopRow":"","loadColumnFamiliesOnDemand":true,"totalColumns":1,"
> >>>> ca
> >>>>
> >>>>
> cheBlocks":true,"families":{"0":["ALL"]},"maxResultSize":-1,"maxVersions
> >>>> ":
> >>>> 1,"filter":"FirstKeyOnlyFilter","caching":-1}
> >>>>
> >>>> The metrics for the regionserver show only 2 total read requests since
> >>>> it
> >>>> started and 0 per second now. Those two requests are for the table I
> >>>> did
> >>>> the count on. The regionserver's UI says it has no active RPC calls.
> >>>> It
> >>>> currently has one stuck handler from running that command:
> >>>>
> >>>> "RpcServer.handler=8,port=60020" daemon prio=10 tid=0x0000000001c08000
> >>>> nid=0x2582 runnable [0x00007fbd5ba6c000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 888)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.KeyValue$RowOnlyComparator.compare(KeyValue.java
> >>>> :2
> >>>> 880)
> >>>>    at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:369)
> >>>>    at java.util.TreeMap.getEntry(TreeMap.java:340)
> >>>>    at java.util.TreeMap.get(TreeMap.java:273)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.isDelete
> >>>> d(
> >>>> GetClosestRowBeforeTracker.java:128)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.addCandi
> >>>> da
> >>>> te(GetClosestRowBeforeTracker.java:107)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.GetClosestRowBeforeTracker.handle(G
> >>>> et
> >>>> ClosestRowBeforeTracker.java:202)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(HStor
> >>>> e.
> >>>> java:1772)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile(H
> >>>> St
> >>>> ore.java:1722)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HStore.j
> >>>> av
> >>>> a:1655)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRegion
> >>>> .j
> >>>> ava:1826)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>> a:
> >>>> 2841)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.
> >>>> ca
> >>>> llBlockingMethod(ClientProtos.java:28857)
> >>>>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcSch
> >>>> ed
> >>>> uler.java:160)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcSched
> >>>> ul
> >>>> er.java:38)
> >>>>    at
> >>>>
> >>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.
> >>>> ja
> >>>> va:110)
> >>>>    at java.lang.Thread.run(Thread.java:744)
> >>>>
> >>>> It also has one listener running:
> >>>>
> >>>> "RpcServer.listener,port=60020" daemon prio=10 tid=0x0000000001db8000
> >>>> nid=0x2579 runnable [0x00007fbd5c376000]
> >>>> java.lang.Thread.State: RUNNABLE
> >>>>    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>>    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>>    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>>    - locked <0x000000059ae28988> (a sun.nio.ch.Util$2)
> >>>>    - locked <0x000000059ae289a0> (a
> >>>> java.util.Collections$UnmodifiableSet)
> >>>>    - locked <0x000000059ae294b8> (a sun.nio.ch.EPollSelectorImpl)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>>    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
> >>>>    at
> >>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:676)
> >>>>
> >>>> If there is anything else you need, please let me know.
> >>>>
> >>>> thanks again,
> >>>> -chris
> >>>>
> >>>>
> >>>> On May 20, 2014, at 2:03 PM, Jeffrey Zhong <jzhong@hortonworks.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> I rechecked your RPC log and found that you didn't turn on the RPC
> >>>>> trace
> >>>>> level. You can set the following so that we can see detail request &
> >>>>> response.
> >>>>>
> >>>>> log4j.logger.org.apache.hadoop.ipc.HBaseServer.trace=TRACE
> >>>>>
> >>>>> If you didn't see time out exception in your region server logs, it
> >>>>> means
> >>>>> that there are something keeping issuing lots of Get requests. Have
> >>>>> you
> >>>>> used secondary index for this table? If that's the case, please
> >>>>> disable/drop it to see if that helps.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 5/20/14 1:21 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> >>>>>
> >>>>>> I will see how possible a new, all bare metal cluster is, but we
> >>>>>> have 5
> >>>>>> other similarly configured clusters with no troubles. We only have
> an
> >>>>>> issue when Phoenix does a full scan of some tables that a handler
> >>>>>> gets
> >>>>>> stuck in
> >>>>>>
> >>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFile
> >>>>>> after it has returned all of its data.
> >>>>>>
> >>>>>> thank you,
> >>>>>>
> >>>>>> -chris
> >>>>>>
> >>>>>> On May 20, 2014, at 12:34 PM, alex kamil <alex.kamil@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> i'm just guessing but may be region servers are stuck because they
> >>>>>>> can't
> >>>>>>> communicate with zookeeper or hmaster, or may be datanodes can't
> >>>>>>> talk
> >>>>>>> to
> >>>>>>> namenode because of some vmware networking quirk and some internal
> >>>>>>> thread
> >>>>>>> is hanging, who knows,
> >>>>>>> it would take 30 min to setup up bare metal cluster vs days
> >>>>>>> debugging
> >>>>>>> this
> >>>>>>> hybrid setup
> >>>>>>> also I'd suggest to install directly from hbase/hadoop websites  to
> >>>>>>> eliminate additional variables
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 20, 2014 at 3:14 PM, Chris Tarnas
> >>>>>>> <cft@biotiquesystems.com>wrote:
> >>>>>>>
> >>>>>>>> Only the master nodes (HMaster, ZK and NN) are VMs. The
> >>>>>>>> datanodes/regionservers with the stuck processes all bare metal.
> >>>>>>>>
> >>>>>>>> -chris
> >>>>>>>>
> >>>>>>>> On May 20, 2014, at 11:52 AM, alex kamil <alex.kamil@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> "with 3 VMWare "master" nodes"
> >>>>>>>>>
> >>>>>>>>> can you try running hbase/phoenix on physical nodes instead of
> >>>>>>>>> using
> >>>>>>>>> virtual machines
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 20, 2014 at 2:24 PM, Chris Tarnas
> >>>>>>>>> <cft@biotiquesystems.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry to follow up on my own message, but I was wondering if
> >>>>>>>>>> anyone
> >>>>>>>>>> had
> >>>>>>>>>> any ideas? Normal non-phoenix scans don't cause this symptom,
> but
> >>>>>>>>>> right
> >>>>>>>>>> after a select * on the exact same table will.
> >>>>>>>>>>
> >>>>>>>>>> If we export the table and then re-import it into a new table,
> >>>>>>>>>> the
> >>>>>>>>>> new
> >>>>>>>>>> table doesn't exhibit these symptoms, same as if we use an
> >>>>>>>> upsert..select
> >>>>>>>>>> to do a copy. It seems something happens to the last region to
> >>>>>>>>>> cause
> >>>>>>>> this,
> >>>>>>>>>> but it is not directly data dependent. Moving the region to
> >>>>>>>>>> another
> >>>>>>>>>> regionserver doesn't have any effect - just moves where the
> >>>>>>>>>> problem
> >>>>>>>>>> happens. Major compactions get hung up by the running threads as
> >>>>>>>>>> they
> >>>>>>>>>> probably have a lock.
> >>>>>>>>>>
> >>>>>>>>>> I've run the hfile tool on the final region and nothing seems
> >>>>>>>>>> awry.
> >>>>>>>>>>
> >>>>>>>>>> Figuring this out will allow this project to continue, as of now
> >>>>>>>>>> it
> >>>>>>>>>> is
> >>>>>>>>>> hung up on this issue.
> >>>>>>>>>>
> >>>>>>>>>> thank you,
> >>>>>>>>>> -chris
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On May 15, 2014, at 8:47 PM, Chris Tarnas
> >>>>>>>>>> <cft@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I did some "poor mans" profiling with multiple jstack and came
> >>>>>>>>>>> up
> >>>>>>>>>>> with
> >>>>>>>>>> where the RpcServer.handler appears to be stuck:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStore
> >>>>>>>>>> Fi
> >>>>>>>>>> le
> >>>>>>>>>> .
> >>>>>>>>>> That is the last deepest method in all of the traces, either
> >>>>>>>>>> HStore.java:1712 or HStore.java:1722. Here are two example
> traces
> >>>>>>>>>> for a
> >>>>>>>>>> thread (which has been running for the last couple hours):
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder$1.decodeNex
> >>>>>>>> t(
> >>>>>>>> Fa
> >>>>>>>> stDiffDeltaEncoder.java:540)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$Buffere
> >>>>>>>> dE
> >>>>>>>> nc
> >>>>>>>> odedSeeker.next(BufferedDataBlockEncoder.java:261)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next
> >>>>>>>> (H
> >>>>>>>> Fi
> >>>>>>>> leReaderV2.java:1063)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.walkForwardInSingleRow(H
> >>>>>>>> St
> >>>>>>>> or
> >>>>>>>> e.java:1776)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1722)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>> "RpcServer.handler=1,port=60020" daemon prio=10
> >>>>>>>>>>> tid=0x0000000000cdb800
> >>>>>>>>>> nid=0x727b runnable [0x00007f4b49e9e000]
> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:
> >>>>>>>> 19
> >>>>>>>> 44
> >>>>>>>> )
> >>>>>>>>>>>  at
> >>>>>>>> org.apache.hadoop.hbase.util.Bytes.binarySearch(Bytes.java:1622)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.ro
> >>>>>>>> ot
> >>>>>>>> Bl
> >>>>>>>> ockContainingKey(HFileBlockIndex.java:392)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.lo
> >>>>>>>> ad
> >>>>>>>> Da
> >>>>>>>> taBlockWithScanInfo(HFileBlockIndex.java:209)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.se
> >>>>>>>> ek
> >>>>>>>> To
> >>>>>>>> DataBlock(HFileBlockIndex.java:179)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.see
> >>>>>>>> kB
> >>>>>>>> ef
> >>>>>>>> ore(HFileReaderV2.java:548)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.rowAtOrBeforeFromStoreFi
> >>>>>>>> le
> >>>>>>>> (H
> >>>>>>>> Store.java:1712)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HStore.getRowKeyAtOrBefore(HSto
> >>>>>>>> re
> >>>>>>>> .j
> >>>>>>>> ava:1655)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegion.getClosestRowBefore(HRe
> >>>>>>>> gi
> >>>>>>>> on
> >>>>>>>> .java:1826)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer
> >>>>>>>> .j
> >>>>>>>> av
> >>>>>>>> a:2841)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServic
> >>>>>>>> e$
> >>>>>>>> 2.
> >>>>>>>> callBlockingMethod(ClientProtos.java:28857)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> >>>>>>>>>>>  at
> >>>>>>>>>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRp
> >>>>>>>> cS
> >>>>>>>> ch
> >>>>>>>> eduler.java:160)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcS
> >>>>>>>> ch
> >>>>>>>> ed
> >>>>>>>> uler.java:38)
> >>>>>>>>>>>  at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcSchedu
> >>>>>>>> le
> >>>>>>>> r.
> >>>>>>>> java:110)
> >>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On May 14, 2014, at 5:54 PM, Jeffrey Zhong
> >>>>>>>>>>> <jzhong@hortonworks.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hey Chris,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I used performance.py tool which created a table with 50K rows
> >>>>>>>>>>>> in
> >>>>>>>>>>>> one
> >>>>>>>>>>>> table, run the following query from sqlline.py and everything
> >>>>>>>>>>>> seems
> >>>>>>>> fine
> >>>>>>>>>>>> without seeing CPU running hot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.166 seconds)
> >>>>>>>>>>>> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*)
> >>>>>>>>>>>> from
> >>>>>>>>>>>> PERFORMANCE_50000;
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> |  COUNT(1)  |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> | 50000      |
> >>>>>>>>>>>> +------------+
> >>>>>>>>>>>> 1 row selected (0.167 seconds)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is there anyway could you run profiler to see where the CPU
> >>>>>>>>>>>> goes?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Ahh, yes. Here is a pastebin for it:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://pastebin.com/w6mtabag
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> thanks again,
> >>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On May 13, 2014, at 7:47 PM, Nick Dimiduk <
> ndimiduk@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chris,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Attachments are filtered out by the mail server. Can you
> >>>>>>>>>>>>>> pastebin it
> >>>>>>>>>>>>>> some
> >>>>>>>>>>>>>> place?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Nick
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >>>>>>>>>>>>>> <cft@biotiquesystems.com>wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We set the HBase RegionServer Handler to 10 (it appears to
> >>>>>>>>>>>>>>> have
> >>>>>>>> been
> >>>>>>>>>>>>>>> set
> >>>>>>>>>>>>>>> to 60 by Ambari during install process). Now we have
> >>>>>>>>>>>>>>> narrowed
> >>>>>>>>>>>>>>> down
> >>>>>>>>>> what
> >>>>>>>>>>>>>>> causes the CPU to increase and have some detailed logs:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we connect using sqlline.py and execute a select that
> >>>>>>>>>>>>>>> selects
> >>>>>>>> one
> >>>>>>>>>>>>>>> row
> >>>>>>>>>>>>>>> using the primary_key, no increate in CPU is observed and
> >>>>>>>>>>>>>>> the
> >>>>>>>> number
> >>>>>>>>>>>>>>> of RPC
> >>>>>>>>>>>>>>> threads in a RUNNABLE state remains the same.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If we execute a select that scans the table such as "select
> >>>>>>>> count(*)
> >>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>> TABLE" or where the "where" clause only limits on
> >>>>>>>>>>>>>>> non-primary
> >>>>>>>>>>>>>>> key
> >>>>>>>>>>>>>>> attributes, then the number of RUNNABLE RpcServer.handler
> >>>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>> increases
> >>>>>>>>>>>>>>> and the CPU utilization of the regionserver increases by
> >>>>>>>>>>>>>>> ~105%.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Disconnecting the client does not have an effect and the
> >>>>>>>>>>>>>>> RpcServer.handler
> >>>>>>>>>>>>>>> thread is left RUNNABLE and the CPU stays at the higher
> >>>>>>>>>>>>>>> usage.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Checking the Web Console for the Regionserver just shows 10
> >>>>>>>>>>>>>>> RpcServer.reader tasks, all in a WAITING state, no other
> >>>>>>>>>>>>>>> monitored
> >>>>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>>> are happening. The regionserver has a Max Heap of 10G and a
> >>>>>>>>>>>>>>> Used
> >>>>>>>> heap
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> 445.2M.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've attached the regionserver log with IPC debug logging
> >>>>>>>>>>>>>>> turned on
> >>>>>>>>>>>>>>> right
> >>>>>>>>>>>>>>> when one of the Phoenix statements is executed (this
> >>>>>>>>>>>>>>> statement
> >>>>>>>>>> actually
> >>>>>>>>>>>>>>> used up the last available handler).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong
> >>>>>>>>>>>>>>> <jzhong@hortonworks.com
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> From the stack, it seems you increase the default rpc
> >>>>>>>>>>>>>>>> handler
> >>>>>>>> number
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> about 60. All handlers are serving Get request(You can
> >>>>>>>>>>>>>>>> search
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServ
> >>>>>>>>>> er
> >>>>>>>>>> .j
> >>>>>>>>>> av
> >>>>>>>>>>>>>>> a:2
> >>>>>>>>>>>>>>>> 841).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You can check why there are so many get requests by adding
> >>>>>>>>>>>>>>>> some
> >>>>>>>> log
> >>>>>>>>>>>>>>>> info
> >>>>>>>>>>>>>>>> or enable hbase rpc trace. I guess if you decrease the
> >>>>>>>>>>>>>>>> number
> >>>>>>>>>>>>>>>> of
> >>>>>>>> rpc
> >>>>>>>>>>>>>>>> handlers per region server, it will mitigate your current
> >>>>>>>>>>>>>>>> issue.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5/12/14 2:28 PM, "Chris Tarnas"
> >>>>>>>>>>>>>>>> <cft@biotiquesystems.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We have hit a problem with Phoenix and regionservers CPU
> >>>>>>>>>>>>>>>>> usage
> >>>>>>>>>>>>>>>>> spiking
> >>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>> to use all available CPU and becoming unresponsive.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> After HDP 2.1 was released we setup a 4 compute node
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>> (with
> >>>>>>>>>> 3
> >>>>>>>>>>>>>>>>> VMWare "master" nodes) to test out Phoenix on it. It is a
> >>>>>>>>>>>>>>>>> plain
> >>>>>>>>>>>>>>>>> Ambari
> >>>>>>>>>>>>>>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM
> >>>>>>>>>>>>>>>>> release
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>> hand
> >>>>>>>>>>>>>>>>> linked in the jar files to the hadoop lib. Everything was
> >>>>>>>>>>>>>>>>> going
> >>>>>>>>>> well
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> we were able to load in ~30k records into several tables.
> >>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>> happened
> >>>>>>>>>>>>>>>>> was after about 3-4 days of being up the regionservers
> >>>>>>>>>>>>>>>>> became
> >>>>>>>>>>>>>>>>> unresponsive and started to use most of the available CPU
> >>>>>>>>>>>>>>>>> (12
> >>>>>>>> core
> >>>>>>>>>>>>>>>>> boxes). Nothing terribly informative was in the logs
> >>>>>>>>>>>>>>>>> (initially
> >>>>>>>> we
> >>>>>>>>>>>>>>>>> saw
> >>>>>>>>>>>>>>>>> some flush messages that seemed excessive, but that was
> >>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>> all
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> time and we changed back to the standard HBase WAL
> codec).
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> able
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> kill the unresponsive regionservers and then restart
> them,
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> will be fine for a day or so but will start to lock up
> >>>>>>>>>>>>>>>>> again.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We've dropped the entire HBase and zookeper information
> >>>>>>>>>>>>>>>>> and
> >>>>>>>> started
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch, but that has not helped.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> James Taylor suggested I send this off here. I've
> >>>>>>>>>>>>>>>>> attached a
> >>>>>>>> jstack
> >>>>>>>>>>>>>>>>> report of a locked up regionserver in hopes that someone
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> shed
> >>>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>> light.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>>>> -chris
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>>>>>> NOTICE: This message is intended for the use of the
> >>>>>>>>>>>>>>>> individual
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> entity
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>>>>>>>>>> confidential,
> >>>>>>>>>>>>>>>> privileged and exempt from disclosure under applicable
> law.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> reader
> >>>>>>>>>>>>>>>> of this message is not the intended recipient, you are
> >>>>>>>>>>>>>>>> hereby
> >>>>>>>>>> notified
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> any printing, copying, dissemination, distribution,
> >>>>>>>>>>>>>>>> disclosure
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> forwarding of this communication is strictly prohibited.
> If
> >>>>>>>>>>>>>>>> you
> >>>>>>>> have
> >>>>>>>>>>>>>>>> received this communication in error, please contact the
> >>>>>>>>>>>>>>>> sender
> >>>>>>>>>>>>>>> immediately
> >>>>>>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>>>>> NOTICE: This message is intended for the use of the individual
> >>>>>>>>>>>> or
> >>>>>>>>>> entity to
> >>>>>>>>>>>> which it is addressed and may contain information that is
> >>>>>>>> confidential,
> >>>>>>>>>>>> privileged and exempt from disclosure under applicable law. If
> >>>>>>>>>>>> the
> >>>>>>>>>> reader
> >>>>>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>>>>>>>> notified
> >>>>>>>>>> that
> >>>>>>>>>>>> any printing, copying, dissemination, distribution, disclosure
> >>>>>>>>>>>> or
> >>>>>>>>>>>> forwarding of this communication is strictly prohibited. If
> you
> >>>>>>>>>>>> have
> >>>>>>>>>>>> received this communication in error, please contact the
> sender
> >>>>>>>>>> immediately
> >>>>>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> CONFIDENTIALITY NOTICE
> >>>>> NOTICE: This message is intended for the use of the individual or
> >>>>> entity to
> >>>>> which it is addressed and may contain information that is
> >>>>> confidential,
> >>>>> privileged and exempt from disclosure under applicable law. If the
> >>>>> reader
> >>>>> of this message is not the intended recipient, you are hereby
> notified
> >>>>> that
> >>>>> any printing, copying, dissemination, distribution, disclosure or
> >>>>> forwarding of this communication is strictly prohibited. If you have
> >>>>> received this communication in error, please contact the sender
> >>>>> immediately
> >>>>> and delete it from your system. Thank You.
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> >>> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> >>> reader
> >>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> >>> immediately
> >>> and delete it from your system. Thank You.
> >>
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>
>


-- 
<http://hortonworks.com/download>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message