kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewen Cheslack-Postava <e...@confluent.io>
Subject Re: If you run Kafka in AWS or Docker, how do you persist data?
Date Sun, 01 Mar 2015 21:08:45 GMT
On Fri, Feb 27, 2015 at 8:09 PM, Jeff Schroeder <jeffschroeder@computer.org>

> Kafka on dedicated hosts running in docker under marathon under Mesos. It
> was a real bear to get working, but is really beautiful once I did manage
> to get it working. I simply run with a unique hostname constraint and
> number of instances = replication factor. If a broker dies and it isn't a
> hardware or network issue, marathon restarts it.
> The hardest part was that Kafka was registering to ZK with the internal (to
> docker) port. My workaround was that you have to use the same port inside
> and outside docker or it will register to ZK with whatever the port is
> inside the container.

You should be able to use advertised.host.name and advertised.port to
control this, so you aren't required to use the same port inside and
outside Docker.

> FYI this is an on premise dedicated Mesos cluster running on bare metal :)
> On Friday, February 27, 2015, James Cheng <jcheng@tivo.com> wrote:
> > Hi,
> >
> > I know that Netflix might be talking about "Kafka on AWS" at the March
> > meetup, but I wanted to bring up the topic anyway.
> >
> > I'm sure that some people are running Kafka in AWS. Is anyone running
> > Kafka within docker in production? How does that work?
> >
> > For both of these, how do you persist data? If on AWS, do you use EBS? Do
> > you use ephemeral storage and then rely on replication? And if using
> > docker, do you persist data outside the docker container and on the host
> > machine?

On AWS, your choice will depend on a tradeoff of tolerance for data loss,
performance, and price sensitivity. You might be able to get better/more
predictable performance out of the ephemeral instance storage, but since
you are presumably running all instances in the same AZ you leave yourself
open to significant data loss if there's a coordinated outage. It's pretty
rare, but it does happen. With EBS you may have to do more work or spread
across more volumes to get the same throughput. Relevant quote from the
docs on provisioned IOPS: "Additionally, you can stripe multiple volumes
together to achieve up to 48,000 IOPS or 800MBps when attached to larger
EC2 instances". (Note MBps not Mbps.) Other considerations: AWS has been
moving most of its instance storage to SSDs, so getting enough instance
storage space can be relatively pricey, and you can also potentially go
with a hybrid setup to get a balance of the two, but you'll need to be very
careful about partition assignment then to ensure at least one copy of
every partition ends up on an EBS-backed node.

For Docker, you probably want the data to be stored on a volume. If
possible, it would be better if non-hardware errors could be resolved just
by restarting the broker. You'll avoid a lot of needless copying of data.
Storing data in a volume would let you simply restart a new container and
have it pick up where the last one left off. The example of Postgres given
for a volume container in https://docs.docker.com/userguide/dockervolumes/
isn't too far from Kafka if you were to assume Postgres was replicating to
a slave -- you'd prefer to reuse the existing data on the existing node
(which a volume container enables), but could still handle bringing up a
new node if necessary.

> >
> > And related, how do you deal with broker failure? Do you simply replace
> > it, and repopulate a new broker via replication? Or do you bring back up
> > the broker with the persisted files?
> >
> > Trying to learn about what people are doing, beyond "on premises and
> > dedicated hardware".
> >
> > Thanks,
> > -James
> >
> >
> --
> Text by Jeff, typos by iPhone


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message