phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Tarnas <...@biotiquesystems.com>
Subject Re: Regionserver burns CPU and stops responding to RPC calls on HDP 2.1
Date Thu, 15 May 2014 15:53:50 GMT
Hi Jeffrey,

The performance.py from HDP 2.1 does not work out of the box - it is not
looking in the right place for the phoenix jar files (even with
PHOENIX_LIB_DIR set). I downloaded the tarball release from apache and that
one works. After running the script and the counts as you suggested I saw
no increase in CPU usage or stuck RpcServer.handlers. I ran a few scans as
well just for good measure. I used the HDP sqlline.py to do the queries.

Re-running the count on one of our tables I did see the problem - but can
add one more bit of info: the only affected regionserver is the
regionserver that hosts the last region of the scanned table, and I cannot
perfectly reproduce it. It seems to happen about 80% of the time. Also, if
there is a stuck RpcServer.handlers it also prevents the regionserver from
exiting, it needs to be killed with -9.

I have one regionserver running with a few stuck handlers, any commands
that  you would like me to run on it?

thank you,
-chris


On Wed, May 14, 2014 at 5:54 PM, Jeffrey Zhong <jzhong@hortonworks.com>wrote:

>
> Hey Chris,
>
> I used performance.py tool which created a table with 50K rows in one
> table, run the following query from sqlline.py and everything seems fine
> without seeing CPU running hot.
>
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.166 seconds)
> 0: jdbc:phoenix:hor11n21.gq1.ygridcore.net> select count(*) from
> PERFORMANCE_50000;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 50000      |
> +------------+
> 1 row selected (0.167 seconds)
>
> Is there anyway could you run profiler to see where the CPU goes?
>
>
>
> On 5/13/14 6:40 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
>
> >Ahh, yes. Here is a pastebin for it:
> >
> >http://pastebin.com/w6mtabag
> >
> >thanks again,
> >-chris
> >
> >On May 13, 2014, at 7:47 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> >
> >> Hi Chris,
> >>
> >> Attachments are filtered out by the mail server. Can you pastebin it
> >>some
> >> place?
> >>
> >> Thanks,
> >> Nick
> >>
> >>
> >> On Tue, May 13, 2014 at 2:56 PM, Chris Tarnas
> >><cft@biotiquesystems.com>wrote:
> >>
> >>> Hello,
> >>>
> >>> We set the HBase RegionServer Handler to 10 (it appears to have been
> >>>set
> >>> to 60 by Ambari during install process). Now we have narrowed down what
> >>> causes the CPU to increase and have some detailed logs:
> >>>
> >>> If we connect using sqlline.py and execute a select that selects one
> >>>row
> >>> using the primary_key, no increate in CPU is observed and the number
> >>>of RPC
> >>> threads in a RUNNABLE state remains the same.
> >>>
> >>> If we execute a select that scans the table such as "select count(*)
> >>>from
> >>> TABLE" or where the "where" clause only limits on non-primary key
> >>> attributes, then the number of RUNNABLE RpcServer.handler threads
> >>>increases
> >>> and the CPU utilization of the regionserver increases by ~105%.
> >>>
> >>> Disconnecting the client does not have an effect and the
> >>>RpcServer.handler
> >>> thread is left RUNNABLE and the CPU stays at the higher usage.
> >>>
> >>> Checking the Web Console for the Regionserver just shows 10
> >>> RpcServer.reader tasks, all in a WAITING state, no other monitored
> >>>tasks
> >>> are happening. The regionserver has a Max Heap of 10G and a Used heap
> >>>of
> >>> 445.2M.
> >>>
> >>> I've attached the regionserver log with IPC debug logging turned on
> >>>right
> >>> when one of the Phoenix statements is executed (this statement actually
> >>> used up the last available handler).
> >>>
> >>> thanks,
> >>> -chris
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On May 12, 2014, at 5:32 PM, Jeffrey Zhong <jzhong@hortonworks.com>
> >>>wrote:
> >>>
> >>>>
> >>>> From the stack, it seems you increase the default rpc handler number
> >>>>to
> >>>> about 60. All handlers are serving Get request(You can search
> >>>>
> >>>
> >>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.jav
> >>>a:2
> >>>> 841).
> >>>>
> >>>> You can check why there are so many get requests by adding some log
> >>>>info
> >>>> or enable hbase rpc trace. I guess if you decrease the number of rpc
> >>>> handlers per region server, it will mitigate your current issue.
> >>>>
> >>>>
> >>>> On 5/12/14 2:28 PM, "Chris Tarnas" <cft@biotiquesystems.com> wrote:
> >>>>
> >>>>> We have hit a problem with Phoenix and regionservers CPU usage
> >>>>>spiking
> >>> up
> >>>>> to use all available CPU and becoming unresponsive.
> >>>>>
> >>>>> After HDP 2.1 was released we setup a 4 compute node cluster (with
3
> >>>>> VMWare "master" nodes) to test out Phoenix on it. It is a plain
> >>>>>Ambari
> >>>>> 1.5/HDP 2.1 install and we added the HDP Phoenix RPM release and
hand
> >>>>> linked in the jar files to the hadoop lib. Everything was going
well
> >>>>>and
> >>>>> we were able to load in ~30k records into several tables. What
> >>>>>happened
> >>>>> was after about 3-4 days of being up the regionservers became
> >>>>> unresponsive and started to use most of the available CPU (12 core
> >>>>> boxes). Nothing terribly informative was in the logs (initially
we
> >>>>>saw
> >>>>> some flush messages that seemed excessive, but that was not all
of
> >>>>>the
> >>>>> time and we changed back to the standard HBase WAL codec). We are
> >>>>>able
> >>> to
> >>>>> kill the unresponsive regionservers and then restart them, the
> >>>>>cluster
> >>>>> will be fine for a day or so but will start to lock up again.
> >>>>>
> >>>>> We've dropped the entire HBase and zookeper information and started
> >>>>>from
> >>>>> scratch, but that has not helped.
> >>>>>
> >>>>> James Taylor suggested I send this off here. I've attached a jstack
> >>>>> report of a locked up regionserver in hopes that someone can shed
> >>>>>some
> >>>>> light.
> >>>>>
> >>>>> thanks,
> >>>>> -chris
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >>>>entity
> >>> to
> >>>> which it is addressed and may contain information that is
> >>>>confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >>>>reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >>> immediately
> >>>> and delete it from your system. Thank You.
> >>>
> >>>
> >>>
>
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message