kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Lerche ...@carllerche.com>
Subject Re: Surprisingly high network traffic between kafka servers
Date Fri, 07 Feb 2014 21:29:10 GMT
Hey Joe,

Those periods with "no traffic" actually are periods of expected
traffic between nodes. It's just that the off period is so high that
the normal traffic is not visible. Also, once traffic goes crazy, the
only way to reset it is to stop all kafka nodes (vs do a rolling
restart).

I have been running the kafka nodes in different AWS AZs so the
bandwidth is costing me. For now, I have temporarily moved to a single
Kafka node. Once I can start collecting metrics, I will attempt to
reproduce the issue.

On Fri, Feb 7, 2014 at 5:42 AM, Joe Stein <joe.stein@stealth.ly> wrote:
> Carl, looking at the boundary chart it looks like you have periods of no
> traffic also... prior to the spikes.
>
> I also noticed you are using AWS from your logs, what instance types are
> you using?  Do you have any network checks in place?
>
> The logs show underReplication=true which leads towards what Joel was
> theorizing as the issue.
>
> Do you track stats on the cluster?
> http://kafka.apache.org/documentation.html#monitoring I would expect
> correlation of changes in the kafka stats and the boundary chart.
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
>
> On Fri, Feb 7, 2014 at 2:47 AM, Carl Lerche <me@carllerche.com> wrote:
>
>> One last thing, I have collected a snippet of the network traffic
>> between Kafka instances using tcpdump. However, it contains some
>> customer data and less than a minutes worth was over 1 GB, so I can't
>> really post it here, but I could possibly share offline if it can help
>> debug the issue.
>>
>> On Thu, Feb 6, 2014 at 11:44 PM, Carl Lerche <me@carllerche.com> wrote:
>> > Re:
>> >
>> >> Could you also check if the on-disk data size/rate match the network
>> >> traffic?
>> >
>> > While I have not explicitly checked this, I would say that the answer
>> > is no. The network is over 1Gbps and I have setup monitoring for disk
>> > space and nothing out of the norm is happening there. The expected
>> > data is on the order of 500 kbits per sec.
>> >
>> > cheers.
>> >
>> > On Thu, Feb 6, 2014 at 9:06 PM, Jun Rao <junrao@gmail.com> wrote:
>> >> Could you also check if the on-disk data size/rate match the network
>> >> traffic?
>> >>
>> >> Thanks,
>> >>
>> >> Jun
>> >>
>> >>
>> >> On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <me@carllerche.com> wrote:
>> >>
>> >>> So, the "good news" is that the problem came back again. The bad news
>> >>> is that I disabled debug logs as it was filling disk (and I had other
>> >>> fires to put out). I will re-enable debug logs and wait for it to
>> >>> happen again.
>> >>>
>> >>> On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <neha.narkhede@gmail.com
>> >
>> >>> wrote:
>> >>> > Carl,
>> >>> >
>> >>> > It will help if you can list the steps to reproduce this issue
>> starting
>> >>> > from a fresh installation. Your setup, the way it stands, seems
to
>> have
>> >>> > gone through some config and state changes.
>> >>> >
>> >>> > Thanks,
>> >>> > Neha
>> >>> >
>> >>> >
>> >>> > On Wed, Feb 5, 2014 at 5:17 PM, Joel Koshy <jjkoshy.w@gmail.com>
>> wrote:
>> >>> >
>> >>> >> On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
>> >>> >> > So, I tried enabling debug logging, I also made some tweaks
to the
>> >>> >> > config (which I probably shouldn't have) and craziness
happened.
>> >>> >> >
>> >>> >> > First, some more context. Besides the very high network
traffic,
>> we
>> >>> >> > were seeing some other issues that we were not focusing
on yet.
>> >>> >> >
>> >>> >> > * Even though the log retention was set to 50GB &
24 hours, data
>> logs
>> >>> >> > were getting cleaned up far quicker quicker. I'm not entirely
>> sure how
>> >>> >> > much quicker, but there was definitely far less than 12
hours and
>> 1GB
>> >>> >> > of data.
>> >>> >> >
>> >>> >> > * Kafka was not properly balanced. We had 3 servers, and
only 2 of
>> >>> >> > them were partition leaders. One server was a replica
for all
>> >>> >> > partitions. We tried to run a rebalance command, but it
did not
>> work.
>> >>> >> > We were going to investigate later.
>> >>> >>
>> >>> >> Were any of the brokers down for an extended period? If the
>> preferred
>> >>> >> replica election command failed it could be because the preferred
>> >>> >> replica was catching up (which could explain the higher than
>> expected
>> >>> >> network traffic). Do you monitor the under-replicated partitions
>> count
>> >>> >> on your cluster? If you have that data it could help confirm
this.
>> >>> >>
>> >>> >> Joel
>> >>> >>
>> >>> >> >
>> >>> >> > So, after restarting all the kafkas, something happened
with the
>> >>> >> > offsets. The offsets that our consumers had no longer
existed. It
>> >>> >> > looks like somehow all the contents was lost? The logs
show many
>> >>> >> > exceptions like:
>> >>> >> >
>> >>> >> > `Request for offset 770354 but we only have log segments
in the
>> range
>> >>> >> > 759234 to 759838.`
>> >>> >> >
>> >>> >> > So, I reset all the consumer offsets to the head of the
queue as
>> I did
>> >>> >> > not know of anything better to do. Once the dust settled,
all the
>> >>> >> > issues we were seeing vanished. Communication between
Kafka nodes
>> >>> >> > appear to be normal, Kafka was able to rebalance, and
hopefully
>> log
>> >>> >> > retention will be normal.
>> >>> >> >
>> >>> >> > I am unsure what happened or how to get more debug information.
>> >>> >> >
>> >>> >> > On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <jay.kreps@gmail.com>
>> >>> wrote:
>> >>> >> > > Can you enable DEBUG logging in log4j and see what
requests are
>> >>> coming
>> >>> >> in?
>> >>> >> > >
>> >>> >> > > -Jay
>> >>> >> > >
>> >>> >> > >
>> >>> >> > > On Tue, Feb 4, 2014 at 9:51 PM, Carl Lerche <me@carllerche.com>
>> >>> wrote:
>> >>> >> > >
>> >>> >> > >> Hi Jay,
>> >>> >> > >>
>> >>> >> > >> I do not believe that I have changed the
>> replica.fetch.wait.max.ms
>> >>> >> > >> setting. Here I have included the kafka config
as well as a
>> >>> snapshot
>> >>> >> > >> of jnettop from one of the servers.
>> >>> >> > >>
>> >>> >> > >> https://gist.github.com/carllerche/4f2cf0f0f6d1e891f482
>> >>> >> > >>
>> >>> >> > >> The bottom row (89.9K/s) is the producer (it
lives on a Kafka
>> >>> server).
>> >>> >> > >> The top two rows are Kafkas on other servers,
you can see the
>> >>> combined
>> >>> >> > >> throughput is ~80MB/s
>> >>> >> > >>
>> >>> >> > >> On Tue, Feb 4, 2014 at 9:36 PM, Jay Kreps <jay.kreps@gmail.com
>> >
>> >>> >> wrote:
>> >>> >> > >> > No this is not normal.
>> >>> >> > >> >
>> >>> >> > >> > Checking twice a second (using 500ms default)
for new data
>> >>> shouldn't
>> >>> >> > >> cause
>> >>> >> > >> > high network traffic (that should be like
< 1KB of
>> overhead). I
>> >>> >> don't
>> >>> >> > >> think
>> >>> >> > >> > that explains things. Is it possible that
setting has been
>> >>> >> overridden?
>> >>> >> > >> >
>> >>> >> > >> > -Jay
>> >>> >> > >> >
>> >>> >> > >> >
>> >>> >> > >> > On Tue, Feb 4, 2014 at 9:25 PM, Guozhang
Wang <
>> >>> wangguoz@gmail.com>
>> >>> >> > >> wrote:
>> >>> >> > >> >
>> >>> >> > >> >> Hi Carl,
>> >>> >> > >> >>
>> >>> >> > >> >> For each partition the follower will
also fetch data from
>> the
>> >>> >> leader
>> >>> >> > >> >> replica, even if there is no new data
in the leader
>> replicas.
>> >>> >> > >> >>
>> >>> >> > >> >> One thing you can try to increase replica.fetch.wait.max.ms
>> >>> (default
>> >>> >> > >> value
>> >>> >> > >> >> 500ms) so that the followers's fetching
request frequency
>> to the
>> >>> >> leader
>> >>> >> > >> can
>> >>> >> > >> >> be reduced, and see if that has some
effect on the traffic.
>> >>> >> > >> >>
>> >>> >> > >> >> Guozhang
>> >>> >> > >> >>
>> >>> >> > >> >>
>> >>> >> > >> >> On Tue, Feb 4, 2014 at 8:46 PM, Carl
Lerche <
>> me@carllerche.com>
>> >>> >> wrote:
>> >>> >> > >> >>
>> >>> >> > >> >> > Hello,
>> >>> >> > >> >> >
>> >>> >> > >> >> > I'm running a 0.8.0 Kafka cluster
of 3 servers. The
>> service
>> >>> that
>> >>> >> it is
>> >>> >> > >> >> > for is not in full production yet,
so the data written to
>> >>> >> cluster is
>> >>> >> > >> >> > minimal (seems to average between
100kb/s -> 300kb/s per
>> >>> >> server). I
>> >>> >> > >> >> > have configured Kafka to have a
3 replicas. I am noticing
>> that
>> >>> >> each
>> >>> >> > >> >> > Kafka server is talking to all
the others at a data rate
>> of
>> >>> >> 40MB/s for
>> >>> >> > >> >> > each server (so, a total of 80MB/s
for each server). This
>> >>> >> > >> >> > communication is constant.
>> >>> >> > >> >> >
>> >>> >> > >> >> > Is this normal? This seems like
very strange behavior and
>> I'm
>> >>> not
>> >>> >> > >> >> > exactly sure how to debug.
>> >>> >> > >> >> >
>> >>> >> > >> >> > Thanks,
>> >>> >> > >> >> > Carl
>> >>> >> > >> >> >
>> >>> >> > >> >>
>> >>> >> > >> >>
>> >>> >> > >> >>
>> >>> >> > >> >> --
>> >>> >> > >> >> -- Guozhang
>> >>> >> > >> >>
>> >>> >> > >>
>> >>> >>
>> >>> >>
>> >>>
>>

Mime
View raw message