kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greenhorn Techie <greenhorntec...@gmail.com>
Subject Re: Query on MirrorMaker Replication - Bi-directional/Failover replication
Date Thu, 19 Jan 2017 00:56:14 GMT
Hi there,

Can anyone please answer to my below follow-up questions for Ewen's
responses.

Thanks

On Tue, 17 Jan 2017 at 00:28 Greenhorn Techie <greenhorntechie@gmail.com>
wrote:

> Thanks Ewen for the detailed response. This is quite helpful and cleared
> some of my doubts. However, I do have some follow-up queries. Can you
> please let me know your thoughts on the same?
>
> [Query] Is non-compacted topics a pre-requisite to have this mechanism
> work as expected? What are the challenges that need to be looked for in
> case of compacted-topics?
> 1. Use MM normally to replicate your data. Be *very* sure you construct
> your setup to ensure *everything* is mirrored (proper # of partitions,
> replication factor, topic level configs, etc). (Note that this is something
> the Confluent replication solution is addressing that's a significant
> gap in MM.)
> [Query] Here we are planning to use MirrorMaker to do the job for us and
> hence topic creation etc is expected to be created by MirrorMaker (by
> setting auto.create.topics.enable=true) Will this work? or will setting auto.create.topics.enable=true
> create the topic with default settings?
> 2. During replication, be sure to record offset deltas for every
> topic partition. These are needed to reverse the direction of
> replication correctly. Make sure to store them in the backup DC and
> somewhere very reliable.
> [Query] Is there any recommended approach to do this? As I am new to
> Kafka, wondering if there is a good way of doing this
> 4. Decide to do failover. Ensure replication has actually stopped (via
> your own tooling, or probably better, by using ACLs to ensure no new data
> can be
> produced from original DC to backup DC)
> [Query] Does stopping replication mean killing the MirrorMaker process? Or
> is there more needed here? Using ACLs probably, we can ensure the mirror
> maker service account doesn't have read access on the source cluster and
> write access on the DR cluster. Is there anything else to be done here?
> 5. Record all the high watermarks for every topic partition so you
> know which data was replicated from the original DC (vs which is new
> after failover).
> [Query] Is there any best practice around this? In the presentation Jun
> Rao talks about time-stamp based offset recording. As I understand, that
> would probably help our case, where we can probably produce messages to the
> DR cluster, from the point of failover
> 7. Once the original DC is back alive, you want to reverse replication
> and make it the backup. Lookup the offset deltas, use them to
> initialize offsets for the consumer group you'll use to do replication.
> [Query]In order to lookup the offset deltas before initiating the
> consumers on the original cluster, is there any recommended
> mechanism/tooling that can be leveraged?
>
> Best Regards
>
> On Fri, 6 Jan 2017 at 03:31 Ewen Cheslack-Postava <ewen@confluent.io>
> wrote:
>
> On Thu, Jan 5, 2017 at 3:07 AM, Greenhorn Techie <
> greenhorntechie@gmail.com>
> wrote:
>
> > Hi,
> >
> > We are planning to setup MirrorMaker based Kafka replication for DR
> > purposes. The base requirement is to have a DR replication from primary
> > (site1) to DR site  (site2)using MirrorMaker,
> >
> > However, we need the solution to work in case of failover as well i.e.
> > where in the event of the site1 kafka cluster failing, site2 kafka
> cluster
> > would be made primary. Later when site1 cluster eventually comes back-up
> > online, direction of replication would be from site2 to site1.
> >
> > But as I understand, the offsets on each of the clusters are different,
> so
> > wondering how to design the solution given this constraint and
> > requirements.
> >
>
> It turns out this is tricky. And once you start digging in you'll find it's
> way more complicated than you might originally think.
>
> Before going down the rabbit hole, I'd suggest taking a look at this great
> talk by Jun Rao (one of the original authors of Kafka) about multi-DC Kafka
> setups: https://www.youtube.com/watch?v=Dvk0cwqGgws
>
> Additionally, I want to mention that while it is tempting to want to treat
> multi-DC DR cases in a way that we get really convenient, strongly
> consistent, highly available behavior because that makes it easier to
> reason about and avoids pushing much of the burden down to applications,
> that's not realistic or practical. And honestly, it's rarely even
> necessary. DR cases really are DR. Usually it is possible to make some
> tradeoffs you might not make under normal circumstances (the most important
> one being the tradeoff between possibly seeing duplicates vs exactly once).
> The tension here is often that one team is responsible for maintain the
> infrastructure and handling this DR failover scenario, and others are
> responsible for the behavior of the applications. The infrastructure team
> is responsible for figuring out the DR failover story but if they don't
> solve it at the infrastructure layer then they get stuck having to
> understand all the current (and future) applications built on that
> infrastructure.
>
> That said, here are the details I think you're looking for:
>
> The short answer right now is that doing DR failover like that is not going
> to be easy with MM. Confluent is building additional tools to deal with
> multi-DC setups because of a bunch of these challenges:
> https://www.confluent.io/product/multi-datacenter/
>
> For your specific concern about reversing the direction of replication,
> you'd need to build additional tooling to support this. The basic list of
> steps would be something like this (assuming non-compacted topics):
>
> 1. Use MM normally to replicate your data. Be *very* sure you construct
> your setup to ensure *everything* is mirrored (proper # of partitions,
> replication factor, topic level configs, etc). (Note that this is something
> the Confluent replication solution is addressing that's a significant gap
> in MM.)
> 2. During replication, be sure to record offset deltas for every topic
> partition. These are needed to reverse the direction of replication
> correctly. Make sure to store them in the backup DC and somewhere very
> reliable.
> 3. Observe DC failure.
> 4. Decide to do failover. Ensure replication has actually stopped (via your
> own tooling, or probably better, by using ACLs to ensure no new data can be
> produced from original DC to backup DC)
> 5. Record all the high watermarks for every topic partition so you know
> which data was replicated from the original DC (vs which is new after
> failover).
> 6. Allow failover to proceed. Make the backup DC primary.
> 7. Once the original DC is back alive, you want to reverse replication and
> make it the backup. Lookup the offset deltas, use them to initialize
> offsets for the consumer group you'll use to do replication.
> 8. Go back to the original DC and make sure there isn't any "extra" data,
> i.e. stuff that didn't get replicated but was successfully written to the
> original DC's cluster. For topic partitions where there is data beyond the
> expected offsets, you currently would need to just delete the entire set of
> data, or at least to before the offset we expect to start at. (A truncate
> operation might be a nice way to avoid having to dump *all* the data, but
> doesn't currently exist.)
> 9. Once you've got the two clusters back in a reasonably synced state with
> appropriate starting offsets committed, start up MM again in the reverse
> direction.
>
> If this sounds tricky, it turns out that when you add compacted topics,
> things get quite a bit messier....
>
> -Ewen
>
>
> >
> > Thanks
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message