kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Söderberg <mathias.soederb...@gmail.com>
Subject Re: Increased CPU usage with 0.8.2-beta
Date Mon, 02 Feb 2015 22:12:42 GMT
Jun,

Yeah, sure, I'll take it for a spin tomorrow.

On Mon Feb 02 2015 at 11:08:42 PM Jun Rao <jun@confluent.io> wrote:

> Mathias,
>
> Thanks for the info. I took a quick look. The biggest difference I saw is
> the org.xerial.snappy.SnappyNative.rawCompress() call. In 0.8.1.1, it uses
> about 0.05% of the CPU. In 0.8.2.0, it uses about 0.10% of the CPU. We did
> upgrade snappy from 1.0.5 in 0.8.1.1 to 1.1.1.6 in 0.8.2.0. Could you try
> to use the same version of snappy in 0.8.1.1 and 0.8.2.0 and rerun your
> tests?
>
> Jun
>
> On Mon, Feb 2, 2015 at 1:15 PM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Hi all,
> >
> > I ran the same hprof test on 0.8.1.1, and also did a re-run on
> > 0.8.2.0-rc2, attached logs from both runs. Both runs lasted for 30-40
> > minutes. The configurations used can be found over here:
> > https://gist.github.com/mthssdrbrg/5fcb9fbdb851d8cc66a2. The
> > configuration used for the first run (on 0.8.2-beta) can be found here:
> > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >
> > The cluster is running on EC2, currently on 4 m1.xlarge, with a ZooKeeper
> > cluster consisting of 3 nodes. There's 1248 topics, and a total of 4143
> > partitions, using replication factor 3.
> >
> > Let me know if there's any other information I can provide.
> >
> > Best regards,
> > Mathias
> >
> > On Mon Feb 02 2015 at 6:50:54 PM Jay Kreps <jay.kreps@gmail.com> wrote:
> >
> >> Looking at that profile:
> >>
> >> Misc. blocking socket activity, not actual CPU work:
> >>    1  5.24%  5.24%   67781 300447 java.net.PlainSocketImpl.socketAccept
> >>    2  5.24% 10.49%   67775 300515 java.net.PlainSocketImpl.socketAccept
> >>    3  5.24% 15.73%   67773 300567 java.net.PlainSocketImpl.socketAccept
> >>    4  5.24% 20.97%   67683 301396 sun.nio.ch.ServerSocketChannelImpl.
> >> accept0
> >>    5  5.23% 26.20%   67582 301395 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    6  5.23% 31.42%   67529 301519 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    7  5.13% 36.55%   66297 302447 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    8  3.95% 40.51%   51071 302446 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    9  3.65% 44.16%   47234 303479 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   10  3.64% 47.80%   47019 302444 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   11  3.62% 51.42%   46768 302445 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   12  3.27% 54.69%   42237 303475 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   13  3.16% 57.85%   40892 303476 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   14  3.14% 60.99%   40556 303478 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   15  3.05% 64.04%   39428 303480 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   16  2.68% 66.72%   34673 303477 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   17  1.23% 67.95%   15867 303520 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   18  0.98% 68.93%   12663 303541 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   19  0.92% 69.85%   11920 303536 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   20  0.85% 70.70%   11015 303546 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   21  0.82% 71.53%   10625 303534 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   22  0.69% 72.21%    8866 303512 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>
> >>   TransferTo to write data to socket, not actual CPU work (mostly)
> >>   23  0.36% 72.57%    4592 304991 sun.nio.ch.FileDispatcherImpl.write0
> >>   24  0.36% 72.92%    4589 305021 sun.nio.ch.FileDispatcherImpl.write0
> >>   25  0.35% 73.27%    4503 304992 sun.nio.ch.FileDispatcherImpl.write0
> >>   26  0.32% 73.60%    4198 305022 sun.nio.ch.FileDispatcherImpl.write0
> >>   27  0.25% 73.85%    3250 305246 sun.nio.ch.FileDispatcherImpl.write0
> >>   28  0.25% 74.10%    3249 305497 sun.nio.ch.FileDispatcherImpl.write0
> >>
> >>   Request channel - Actual CPU
> >>   29  0.22% 74.32%    2862 305000 sun.misc.Unsafe.unpark
> >>   30  0.17% 74.49%    2163 304838 sun.misc.Unsafe.unpark
> >>   31  0.14% 74.63%    1795 305240 sun.misc.Unsafe.unpark
> >>
> >>   Purgatory - Actual CPU
> >>   32  0.12% 74.75%    1553 305137 scala.collection.immutable.
> >> HashMap.$plus
> >>   33  0.12% 74.87%    1546 305100 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   34  0.12% 74.98%    1531 305181 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   35  0.12% 75.10%    1526 305234 scala.collection.immutable.
> >> HashMap.$plus
> >>   36  0.12% 75.22%    1521 305401 scala.collection.immutable.
> >> HashMap.$plus
> >>   37  0.12% 75.34%    1519 305186 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   38  0.12% 75.46%    1517 305264 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   39  0.12% 75.57%    1514 305271 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   40  0.12% 75.69%    1511 305250 scala.collection.immutable.
> >> HashMap.$plus
> >>   41  0.12% 75.81%    1499 305155 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   42  0.12% 75.92%    1496 305113 scala.collection.immutable.
> >> HashMap.$plus
> >>   43  0.12% 76.04%    1496 305263 scala.collection.immutable.
> >> HashMap.$plus
> >>   44  0.11% 76.15%    1480 305235 scala.collection.immutable.
> >> HashMap.$plus
> >>   45  0.11% 76.26%    1444 305185 scala.collection.immutable.
> >> HashMap.$plus
> >>   46  0.11% 76.37%    1428 305102 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   47  0.11% 76.48%    1418 305320 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>
> >>   Compression - Actual CPU
> >>   48  0.09% 76.58%    1187 306546 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   49  0.09% 76.67%    1174 305584 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   50  0.09% 76.76%    1173 305545 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   51  0.09% 76.85%    1165 305661 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   52  0.09% 76.94%    1152 307069 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   53  0.09% 77.02%    1121 305997 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   54  0.09% 77.11%    1117 307035 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   55  0.09% 77.19%    1106 306190 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>
> >> On Mon, Feb 2, 2015 at 9:39 AM, Jay Kreps <jay.kreps@gmail.com> wrote:
> >>
> >> > Ah, yeah, you're right. That is just wait time not CPU time. We should
> >> > check that profile it must be something else on the list.
> >> >
> >> > -Jay
> >> >
> >> > On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <jun@confluent.io> wrote:
> >> >
> >> >> Hi, Mathias,
> >> >>
> >> >> From the hprof output, it seems that the top CPU consumers are
> >> >> socketAccept() and epollWait(). As far as I am aware, there hasn't
> been
> >> >> any
> >> >> significant changes in the socket server code btw 0.8.1 and 0.8.2.
> >> Could
> >> >> you run the same hprof test on 0.8.1 so that we can see the
> difference?
> >> >>
> >> >> Jaikiran,
> >> >>
> >> >> The fix you provided in probably unnecessary. The channel that we use
> >> in
> >> >> SimpleConsumer (BlockingChannel) is configured to be blocking. So
> even
> >> >> though the read from the socket is in a loop, each read blocks if
> >> there is
> >> >> no bytes received from the broker. So, that shouldn't cause extra CPU
> >> >> consumption.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Jun
> >> >>
> >> >> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> >> >> mathias.soederberg@gmail.com> wrote:
> >> >>
> >> >> > Hi Neha,
> >> >> >
> >> >> > I sent an e-mail earlier today, but noticed now that it didn't
> >> actually
> >> >> go
> >> >> > through.
> >> >> >
> >> >> > Anyhow, I've attached two files, one with output from a 10 minute
> run
> >> >> and
> >> >> > one with output from a 30 minute run. Realized that maybe I
> should've
> >> >> done
> >> >> > one or two runs with 0.8.1.1 as well, but nevertheless.
> >> >> >
> >> >> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing
the
> >> same
> >> >> CPU
> >> >> > usage as with the beta version (basically pegging all cores).
If I
> >> >> manage
> >> >> > to find the time I'll do another run with hprof on the rc2 version
> >> later
> >> >> > today.
> >> >> >
> >> >> > Best regards,
> >> >> > Mathias
> >> >> >
> >> >> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io
> >
> >> >> wrote:
> >> >> >
> >> >> >> The following should be sufficient
> >> >> >>
> >> >> >> java
> >> >> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >> >> >> y,thread=y,file=kafka.hprof
> >> >> >> <classname>
> >> >> >>
> >> >> >> You would need to start the Kafka server with the settings
above
> for
> >> >> >> sometime until you observe the problem.
> >> >> >>
> >> >> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >> >> >> mathias.soederberg@gmail.com> wrote:
> >> >> >>
> >> >> >> > Hi Neha,
> >> >> >> >
> >> >> >> > Yeah sure. I'm not familiar with hprof, so any particular
> options
> >> I
> >> >> >> should
> >> >> >> > include or just run with defaults?
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Mathias
> >> >> >> >
> >> >> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> neha@confluent.io
> >> >
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > > Thanks for reporting the issue. Would you mind running
hprof
> and
> >> >> >> sending
> >> >> >> > > the output?
> >> >> >> > >
> >> >> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg
<
> >> >> >> > > mathias.soederberg@gmail.com> wrote:
> >> >> >> > >
> >> >> >> > > > Good day,
> >> >> >> > > >
> >> >> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to
v0.8.2-beta and
> >> >> noticed
> >> >> >> > that
> >> >> >> > > > the CPU usage on the broker machines went up
by roughly 40%,
> >> from
> >> >> >> ~60%
> >> >> >> > to
> >> >> >> > > > ~100% and am wondering if anyone else has experienced
> >> something
> >> >> >> > similar?
> >> >> >> > > > The load average also went up by 2x-3x.
> >> >> >> > > >
> >> >> >> > > > We're running on EC2 and the cluster currently
consists of
> >> four
> >> >> >> > > m1.xlarge,
> >> >> >> > > > with roughly 1100 topics / 4000 partitions.
Using Java 7
> >> >> (1.7.0_65
> >> >> >> to
> >> >> >> > be
> >> >> >> > > > exact) and Scala 2.9.2. Configurations can
be found over
> here:
> >> >> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >> >> >> > > >
> >> >> >> > > > I'm assuming that this is not expected behaviour
for
> >> 0.8.2-beta?
> >> >> >> > > >
> >> >> >> > > > Best regards,
> >> >> >> > > > Mathias
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > --
> >> >> >> > > Thanks,
> >> >> >> > > Neha
> >> >> >> > >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Thanks,
> >> >> >> Neha
> >> >> >>
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message