kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charity Majors <char...@hound.sh>
Subject Re: kafka + autoscaling groups fuckery
Date Sun, 03 Jul 2016 18:35:52 GMT
It would be cool to know if Netflix runs kafka in ASGs ... I can't find any
mention of it online.  (https://github.com/Netflix/suro/wiki/FAQ sorta
implies that maybe they do, but it's not clear, and also old.)

I've seen other people talking about running kafka in ASGs, e.g.
http://blog.jimjh.com/building-elastic-clusters.html, but they all rely on
reusing broker IDs.  Which certainly makes it easier, but imho is the wrong
way to do this for all the reasons I listed before.

On Sun, Jul 3, 2016 at 11:29 AM, Charity Majors <charity@hound.sh> wrote:

> Great talks, but not relevant to either of my problems -- the golang
> client not rebalancing the consumer offset topic, or autoscaling group
> behavior (which is I think is probably just a consequence of the first).
>
> Thanks though, there's good stuff in here.
>
> On Sun, Jul 3, 2016 at 10:23 AM, James Cheng <wushujames@gmail.com> wrote:
>
>> Charity,
>>
>> I'm not sure about the specific problem you are having, but about Kafka
>> on AWS, Netflix did a talk at a meetup about their Kafka installation on
>> AWS. There might be some useful information in there. There is a video
>> stream as well as slides, and maybe you can get in touch with the speakers.
>> Look in the comment section for links to the slides and video.
>>
>> Kafka at Netflix
>>
>> http://www.meetup.com//http-kafka-apache-org/events/220355031/?showDescription=true
>>
>> There's also a talk about running Kafka on Mesos, which might be relevant.
>>
>> Kafka on Mesos
>>
>> http://www.meetup.com//http-kafka-apache-org/events/222537743/?showDescription=true
>>
>> -James
>>
>> Sent from my iPhone
>>
>> > On Jul 2, 2016, at 5:15 PM, Charity Majors <charity@hound.sh> wrote:
>> >
>> > Gwen, thanks for the response.
>> >
>> > 1.1 Your life may be a bit simpler if you have a way of starting a new
>> >
>> >> broker with the same ID as the old one - this means it will
>> >> automatically pick up the old replicas and you won't need to
>> >> rebalance. Makes life slightly easier in some cases.
>> >
>> > Yeah, this is definitely doable, I just don't *want* to do it.  I really
>> > want all of these to share the same code path: 1) rolling all nodes in
>> an
>> > ASG to pick up a new AMI, 2) hardware failure / unintentional node
>> > termination, 3) resizing the ASG and rebalancing the data across nodes.
>> >
>> > Everything but the first one means generating new node IDs, so I would
>> > rather just do that across the board.  It's the solution that really
>> fits
>> > the ASG model best, so I'm reluctant to give up on it.
>> >
>> >
>> >> 1.2 Careful not too rebalance too many partitions at once - you only
>> >> have so much bandwidth and currently Kafka will not throttle
>> >> rebalancing traffic.
>> >
>> > Nod, got it.  This is def something I plan to work on hardening once I
>> have
>> > the basic nut of things working (or if I've had to give up on it and
>> accept
>> > a lesser solution).
>> >
>> >
>> >> 2. I think your rebalance script is not rebalancing the offsets topic?
>> >> It still has a replica on broker 1002. You have two good replicas, so
>> >> you are no where near disaster, but make sure you get this working
>> >> too.
>> >
>> > Yes, this is another problem I am working on in parallel.  The Shopify
>> > sarama library <https://godoc.org/github.com/Shopify/sarama> uses the
>> > __consumer_offsets topic, but it does *not* let you rebalance or resize
>> the
>> > topic when consumers connect, disconnect, or restart.
>> >
>> > "Note that Sarama's Consumer implementation does not currently support
>> > automatic consumer-group rebalancing and offset tracking"
>> >
>> > I'm working on trying to get the sarama-cluster to do something here.  I
>> > think these problems are likely related, I'm not sure wtf you are
>> > *supposed* to do to rebalance this god damn topic.  It also seems like
>> we
>> > aren't using a consumer group which sarama-cluster depends on to
>> rebalance
>> > a topic.  I'm still pretty confused by the 0.9 "consumer group" stuff.
>> >
>> > Seriously considering downgrading to the latest 0.8 release, because
>> > there's a massive gap in documentation for the new stuff in 0.9 (like
>> > consumer groups) and we don't really need any of the new features.
>> >
>> > A common work-around is to configure the consumer to handle "offset
>> >> out of range" exception by jumping to the last offset available in the
>> >> log. This is the behavior of the Java client, and it would have saved
>> >> your consumer here. Go client looks very low level, so I don't know
>> >> how easy it is to do that.
>> >
>> > Erf, this seems like it would almost guarantee data loss.  :(  Will
>> check
>> > it out tho.
>> >
>> > If I were you, I'd retest your ASG scripts without the auto leader
>> >> election - since your own scripts can / should handle that.
>> >
>> > Okay, this is straightforward enough.  Will try it.  And will keep
>> tryingn
>> > to figure out how to balance the __consumer_offsets topic, since I
>> > increasingly think that's the key to this giant mess.
>> >
>> > If anyone has any advice there, massively appreciated.
>> >
>> > Thanks,
>> >
>> > charity.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message