kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@gmail.com>
Subject Re: Kafka questions
Date Tue, 19 Jul 2011 04:37:47 GMT
Hi, Paul,

We'd love to get help from you if Kafka fits your need. See my inlined reply
below. It seems that you care about latency and reliability more than
throughput. Are you dealing with high-volume events here?

Jun

On Mon, Jul 18, 2011 at 8:27 PM, Paul Sutter <psutter@quantbench.com> wrote:

> Jun
>
> Thanks for your answers and the link to the paper - that helps a lot,
> especially the comment in the paper that 10 second end to end latency is
> good enough for your intended use case.
>
> We're looking for much lower latencies, and the basic design of Kafka feels
> like it should support latencies in milliseconds with a few changes. We're
> either going to build our own system, or help develop something that
> already
> exists, so please take my comments in the constructive way they're intended
> (I realize the changes I'm suggesting are outside your intended use case,
> but if you're interested we may be able to provide a very capable developer
> to help with the work, assuming we choose kafka over the other zillion
> streaming systems that are coming out of the woodwork).
>
> a. *Producer "queue.time"* - In my question 4 below, I was referring to the
> producer queue time.  With a default value of 5 seconds, that accounts for
> half your end to end latency. A system like zeromq is optimized to write
> data immediately without delay, but in such a way to minimizes the number
> of
> system calls required during high throughput messages. Zeromq is no
> nirvana,
> but it has a number of nice properties.
>

One can reduce queue.time to reduce latency. Our producer api also allows a
user to send a message synchronously. This will increase the number of RPC
calls. Whether this becomes a problem depends on your data volume.


>
> b. *Broker "log.default.flush.interval.ms"* - The default value of 3
> seconds
> appears to be another significant source of latency in the system, assuming
> that clients are unable to access data until it has been flushed. Since you
> have wisely chosen to take advantage of the buffer cache as part of your
> system design, it seems that you could remove this latency completely by
> memory mapping the partitions and memcpying each message as it arrives.
> With
> the right IPC mechanism clients could have immediate access to new
> messages.
>
> Currently, we only expose flushed messages to the consumer. I would like to
have a continuous flusher, which flushes dirty pages to disk asynchronously
as fast as it can. That way, a user doesn't have to set the flush intervals
manually. Also, in our replication design, we will allow a message to be
exposed to the consumer as soon as the message hits multiple brokers in
memory.


> c. *Batching, sync vs async, replication, and auditing*. Its understandable
> that you've chosen a a forensic approach to producer reliability (after the
> fact auditing), but when you implement replication it would be really nice
> to revise the producer protocol mechanisms. If you used a streaming
> mechanism with producer offsets and ACKs, you could ensure reliable
> delivery
> of producer streams to multiple brokers without the need to choose a "batch
> size" or "queue.time". This could also give you active/active failover of
> brokers. This may also help in the WAN case (my question 3 below) because
> you will be able to adaptively stuff more and more data through the fiber
> for high bandwidth*delay links without having to choose a large "batch
> size"
> nor have the additional latency that entails. Oh, and it will help you deal
> with CRC errors once you start checking for them.
>
> We do plan to add producer side ACK, likely as part of the replication
work. Our current replication design is intended for a cluster of machines
within the same data center. Across DC, we still plan to use asynchronously
replication through embedded consumers. However, even with async
replication, I believe that we can achieve latency much better than 10 secs,
by shrinking the buffering time, flush interval, etc. Since a round trip RPC
across DC can be 100ms itself, it would be hard to keep the latency at the
millisecs level. A couple of seconds delay should be achievable.

c. *Performance measurements* - I'd like to make a suggestion for your
> performance measurements. Your benchmarks measure throughput, but a
> throughput number is meaningless without an associated "% cpu time".
> Ideally
> all measurements achieve wire speed (100MB/sec) at 0% CPU (since, after
> all,
> this is plumbing and we assume the cores in the system should have capacity
> set aside for useful work too). Obviously nobody ever achieves this, but by
> measuring it one can raise the bar in terms of optimization.
>
> That's a good suggestion. The CPU usage for Kafka is really low (typically
less than 10%) and our performance is really bound by disk I/O.


> Paul
>
> ps. Just for background, I am the cofounder at Quantcast where we process
> 3.5PB of data per day. These questions are related to my new startup
> Quantbench which will deal with financial market data where you dont want
> any latency at all. And WAN issues are a big deal too. Incidentally, I was
> also founder of Orbital Data which was a WAN optimization company so I've
> done a lot of work with protocols over long distances.
>
> On Mon, Jul 18, 2011 at 7:14 PM, Jun Rao <junrao@gmail.com> wrote:
>
> > Paul,
> >
> > Excellent questions. See my answers below. Thanks,
> >
> > On Mon, Jul 18, 2011 at 6:41 PM, Paul Sutter <psutter@quantbench.com>
> > wrote:
> >
> > > Kafka looks like an exciting project, thanks for opening it up.
> > >
> > > I have a few questions:
> > >
> > > 1. Are checksums end to end (ie, created by the producer and checked by
> > the
> > > consumer)? or are they only used to confirm buffercache behavior on
> disk
> > as
> > > mentioned in the documentation? Bit errors occur vastly more often than
> > > most
> > > people assume, often because of device driver bugs. TCP only detects 1
> > > error
> > > in 65536, so errors can flow through (if you like I can send links to
> > > papers
> > > describing the need for checksums everywhere).
> > >
> >
> > Checksum is generated at the producer and propagated to the broker and
> > eventually the consumer. Currently, we only validate the checksum at the
> > broker. We could further validate it at the consumer in the future.
> >
> > >
> > > 2. The consumer has a pretty solid mechanism to ensure it hasnt missed
> > any
> > > messages (i like the design by the way), but how does the producer know
> > > that
> > > all of its messages have been stored? (no apparent message id on that
> > side
> > > since the message id isnt known until the message is written to the
> > file).
> > > I'm especially curious how failover/replication could be implemented
> and
> > > I'm
> > > thinking that acks on the publisher side may help)
> > >
> >
> > The producer side auditing is not built-in. At LinkedIn, we do that by
> > generating an auditing event periodically in the eventhandler of the
> async
> > producer. The auditing event contains the number of events produced in a
> > configured window (e.g., 10 minutes) and are sent to a separate topic.
> The
> > consumer can read the actual data and the auditing event and compare the
> > counts. See our paper (
> >
> >
> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
> > )
> > for some more details.
> >
> >
> > >
> > > 3. Has the consumer's flow control been tested over high
> bandwidth*delay
> > > links? (what bandwidth can you get from a London consumer of an SF
> > > cluster?)
> > >
> > > Yes, we actually replicate kafka data across data centers, using an
> > embedded consumer in a broker. Again, there is a bit more info on this in
> > our paper.
> >
> >
> > > 4. What kind of performance do you get if you set the producer's
> message
> > > delay to zero? (ie, is there a separate system call for each message?
> or
> > do
> > > you manage to aggregate messages into a smaller number of system calls
> > even
> > > with a delay of 0?)
> > >
> > > I assume that you are referring to the flush interval. One can
> configure
> > to
> > flush every message to disk. This will slow down the throughput
> > significantly.
> >
> >
> > > 5. Have you considered using a library like zeromq for the messaging
> > layer
> > > instead of rolling your own? (zeromq will handle #4 cleanly at millions
> > of
> > > messages per second and has clients in 20 languages)
> > >
> > > No. Our proprietary format allows us to support things like compression
> > in
> > the future. However, we can definitely look into the zeromq format. Is
> > their
> > messaging layer easily extractable?
> >
> >
> > > 6. Do you have any plans to support intermediate processing elements
> the
> > > way
> > > Flume supports?
> > >
> > > For now, we are just focusing on getting the raw messaging layer solid.
> > We
> > have worked a bit on streaming processing and will look into that again
> in
> > the future.
> >
> >
> > > 7. The docs mention that new versions will only be released after they
> > are
> > > in production at LinkedIn? Does that mean that the latest version of
> the
> > > source code is hidden at LinkedIn and contributors would have to throw
> > > patches over the wall and wait months to get the integrated product?
> > >
> > > What we ran at LinkedIn is the same version in open source and there is
> > no
> > internal repository of Kafka at LinkedIn. We plan to maintain that in the
> > future.
> >
> >
> > > Thanks!
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message