kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexis Richardson <alexis.richard...@gmail.com>
Subject Re: Arguments for Kafka over RabbitMQ ?
Date Sat, 08 Jun 2013 20:09:05 GMT
Jonathan

I am aware of the difference between sequential writes and other kinds
of writes ;p)

AFAIK the Kafka docs describe a sort of platonic alternative system,
eg "normally people do this.. Kafka does that..".  This is a good way
to explain design decisions.  However, I think you may be assuming
that Rabbit is a lot like the generalised other system.  But it is not
- eg Rabbit does not do lots of random IO.  I'm led to understand that
Rabbit's msg store is closer to log structured storage (a la
Log-Structured Merge Trees) in some ways.  However, Rabbit does do
more synchronous I/O, and has a different caching strategy (AFAIK).
"It's complicated"

In order to help provide useful info to the community, please could
you describe a concrete test that we could discuss?  I think that
would really help.  You mentioned a scenario with one large data set
being streamed into the broker(s), and then consumed (in full?) by 2+
consumers of wildly varying speeds.  Could you elaborate please?

alexis


Also, this is probably OT but I have never grokked this in the Design Doc:

"Consumer rebalancing is triggered on each addition or removal of both
broker nodes and other consumers within the same group. For a given
topic and a given consumer group, broker partitions are divided evenly
among consumers within the group."

When a new consumer and/or partition appears, can messages in the
broker get "moved" from one partition to another?


On Sat, Jun 8, 2013 at 12:53 PM, Jonathan Hodges <hodgesz@gmail.com> wrote:
> On Sat, Jun 8, 2013 at 2:09 AM, Jonathan Hodges <hodgesz@gmail.com> wrote:
>> Thanks so much for your replies.  This has been a great help understanding
>> Rabbit better with having very little experience with it.  I have a few
>> follow up comments below.
>
> Happy to help!
>
> I'm afraid I don't follow your arguments below.  Rabbit contains many
> optimisations too.  I'm told that it is possible to saturate the disk
> i/o, and you saw the message rates I quoted in the previous email.
> YES of course there are differences, mostly an accumulation of things.
>  For example Rabbit spends more time doing work before it writes to
> disk.
>
> It would be great if you can you detail some of the optimizations?  It
> would seem to me Rabbit has much more overhead due to maintaining state of
> the consumers as well as general messaging processing which makes it
> impossible to manage the same write throughput as Kafka when you need to
> persist large amounts of data to disk.  I definitely believe you that
> Rabbit can saturate the disk but it is much more seek centric i.e. random
> access read/writes vs sequential read/writes.  Kafka saturates the disk
> too, but since it leverages sequential disk I/O is orders of magnitude more
> efficient persisting to disk than random access.
>
>
> You said:
>
> "Since Rabbit must maintain the state of the
> consumers I imagine it’s subjected to random data access patterns on disk
> as opposed to sequential."
>
> I don't follow the logic here, sorry.
>
> Couple of side comments:
>
> * In your Hadoop vs RT example, Rabbit would deliver the RT messages
> immediately and write the rest to disk.  It can do this at high rates
> - I shall try to get you some useful data here.
>
> * Bear in mind that write speed should be orthogonal to read speed.
> Ask yourself - how would Kafka provide a read cache, and when might
> that be useful?
>
> * I'll find out what data structure Rabbit uses for long term persistence.
>
> What I am saying here is when Rabbit needs to retrieve and persist each
> consumer’s state from its internal DB this information isn’t linearly
> persisted on disk so it requires disk seeks which is in much less
> inefficient than sequential access.  You do get the difference here,
> correct?  Sequential reads from disk are nearly 1.5x faster than random
> reads from memory and 4-5 orders of magnitude faster than random reads from
> disk (http://queue.acm.org/detail.cfm?id=1563874).
>
> As was detailed at length in my previous post Kafka uses the OS
> pagecache/sendfile which is much more efficient than memory or applications
> cache.
>
> That would be awesome if you can confirm what Rabbit is using as a
> persistent data structure.  More importantly, whether it is BTree or
> something else, is the disk i/o random or linear?
>
>
> "Quoting the Kafka design page (
> http://kafka.apache.org/07/design.html) performance of sequential writes on
> a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
> random writes is only about 50k/sec—a difference of nearly 10000X."
>
> Depending on your use case, I'd expect 2x-10x overall throughput
> differences, and will try to find out more info.  As I said, Rabbit
> can saturate disk i/o.
>
> This is only speaking of the use case of high throughput with persisting
> large amounts of data to disk where there is 4 orders of magnitude more
> than 10x difference.  It all comes down to random vs sequential
> writes/reads to disk as I mentioned above.
>
>
> On Sat, Jun 8, 2013 at 2:07 AM, Alexis Richardson <
> alexis.richardson@gmail.com> wrote:
>
>> Jonathan
>>
>> On Sat, Jun 8, 2013 at 2:09 AM, Jonathan Hodges <hodgesz@gmail.com> wrote:
>> > Thanks so much for your replies.  This has been a great help
>> understanding
>> > Rabbit better with having very little experience with it.  I have a few
>> > follow up comments below.
>>
>> Happy to help!
>>
>> I'm afraid I don't follow your arguments below.  Rabbit contains many
>> optimisations too.  I'm told that it is possible to saturate the disk
>> i/o, and you saw the message rates I quoted in the previous email.
>> YES of course there are differences, mostly an accumulation of things.
>>  For example Rabbit spends more time doing work before it writes to
>> disk.
>>
>> You said:
>>
>> "Since Rabbit must maintain the state of the
>> consumers I imagine it’s subjected to random data access patterns on disk
>> as opposed to sequential."
>>
>> I don't follow the logic here, sorry.
>>
>> Couple of side comments:
>>
>> * In your Hadoop vs RT example, Rabbit would deliver the RT messages
>> immediately and write the rest to disk.  It can do this at high rates
>> - I shall try to get you some useful data here.
>>
>> * Bear in mind that write speed should be orthogonal to read speed.
>> Ask yourself - how would Kafka provide a read cache, and when might
>> that be useful?
>>
>> * I'll find out what data structure Rabbit uses for long term persistence.
>>
>>
>> "Quoting the Kafka design page (
>> http://kafka.apache.org/07/design.html) performance of sequential writes
>> on
>> a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
>> random writes is only about 50k/sec—a difference of nearly 10000X."
>>
>> Depending on your use case, I'd expect 2x-10x overall throughput
>> differences, and will try to find out more info.  As I said, Rabbit
>> can saturate disk i/o.
>>
>> alexis
>>
>>
>>
>>
>> >
>> >> While you are correct the payload is a much bigger concern, managing the
>> >> metadata and acks centrally on the broker across multiple clients at
>> scale
>> >> is also a concern.  This would seem to be exasperated if you have
>> > consumers
>> >> at different speeds i.e. Storm and Hadoop consuming the same topic.
>> >>
>> >> In that scenario, say storm consumes the topic messages in real-time and
>> >> Hadoop consumes once a day.  Let’s assume the topic consists of 100k+
>> >> messages/sec throughput so that in a given day you might have 100s GBs
>> of
>> >> data flowing through the topic.
>> >>
>> >> To allow Hadoop to consume once a day, Rabbit obviously can’t keep 100s
>> > GBs
>> >> in memory and will need to persist this data to its internal DB to be
>> >> retrieved later.
>> >
>> > I am not sure why you think this is a problem?
>> >
>> > For a fixed number of producers and consumers, the pubsub and delivery
>> > semantics of Rabbit and Kafka are quite similar.  Think of Rabbit as
>> > adding an in-memory cache that is used to (a) speed up read
>> > consumption, (b) obviate disk writes when possible due to all client
>> > consumers being available and consuming.
>> >
>> >
>> > Actually I think this is the main use case that sets Kafka apart from
>> > Rabbit and speaks to the poster’s ‘Arguments for Kafka over RabbitMQ’
>> > question.  As you mentioned Rabbit is a general purpose messaging system
>> > and along with that has a lot of features not found in Kafka.  There are
>> > plenty of times when Rabbit makes more sense than Kafka, but not when you
>> > are maintaining large message stores and require high throughput to disk.
>> >
>> > Persisting 100s GBs of messages to disk is a much different problem than
>> > managing messages in memory.  Since Rabbit must maintain the state of the
>> > consumers I imagine it’s subjected to random data access patterns on disk
>> > as opposed to sequential.  Quoting the Kafka design page (
>> > http://kafka.apache.org/07/design.html) performance of sequential
>> writes on
>> > a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
>> > random writes is only about 50k/sec—a difference of nearly 10000X.
>> >
>> > They go on to say persistent data structure used in messaging systems
>> > metadata is often a BTree. BTrees are the most versatile data structure
>> > available, and make it possible to support a wide variety of
>> transactional
>> > and non-transactional semantics in the messaging system. They do come
>> with
>> > a fairly high cost, though: Btree operations are O(log N). Normally O(log
>> > N) is considered essentially equivalent to constant time, but this is not
>> > true for disk operations. Disk seeks come at 10 ms a pop, and each disk
>> can
>> > do only one seek at a time so parallelism is limited. Hence even a
>> handful
>> > of disk seeks leads to very high overhead. Since storage systems mix very
>> > fast cached operations with actual physical disk operations, the observed
>> > performance of tree structures is often superlinear. Furthermore BTrees
>> > require a very sophisticated page or row locking implementation to avoid
>> > locking the entire tree on each operation. The implementation must pay a
>> > fairly high price for row-locking or else effectively serialize all
>> reads.
>> > Because of the heavy reliance on disk seeks it is not possible to
>> > effectively take advantage of the improvements in drive density, and one
>> is
>> > forced to use small (< 100GB) high RPM SAS drives to maintain a sane
>> ratio
>> > of data to seek capacity.
>> >
>> > Intuitively a persistent queue could be built on simple reads and appends
>> > to files as is commonly the case with logging solutions. Though this
>> > structure would not support the rich semantics of a BTree implementation,
>> > but it has the advantage that all operations are O(1) and reads do not
>> > block writes or each other. This has obvious performance advantages since
>> > the performance is completely decoupled from the data size--one server
>> can
>> > now take full advantage of a number of cheap, low-rotational speed 1+TB
>> > SATA drives. Though they have poor seek performance, these drives often
>> > have comparable performance for large reads and writes at 1/3 the price
>> and
>> > 3x the capacity.
>> >
>> > Having access to virtually unlimited disk space without penalty means
>> that
>> > we can provide some features not usually found in a messaging system. For
>> > example, in kafka, instead of deleting a message immediately after
>> > consumption, we can retain messages for a relative long period (say a
>> week).
>> >
>> > Our assumption is that the volume of messages is extremely high, indeed
>> it
>> > is some multiple of the total number of page views for the site (since a
>> > page view is one of the activities we process). Furthermore we assume
>> each
>> > message published is read at least once (and often multiple times), hence
>> > we optimize for consumption rather than production.
>> >
>> > There are two common causes of inefficiency: too many network requests,
>> and
>> > excessive byte copying.
>> >
>> > To encourage efficiency, the APIs are built around a "message set"
>> > abstraction that naturally groups messages. This allows network requests
>> to
>> > group messages together and amortize the overhead of the network
>> roundtrip
>> > rather than sending a single message at a time.
>> >
>> > The MessageSet implementation is itself a very thin API that wraps a byte
>> > array or file. Hence there is no separate serialization or
>> deserialization
>> > step required for message processing, message fields are lazily
>> > deserialized as needed (or not deserialized if not needed).
>> >
>> > The message log maintained by the broker is itself just a directory of
>> > message sets that have been written to disk. This abstraction allows a
>> > single byte format to be shared by both the broker and the consumer (and
>> to
>> > some degree the producer, though producer messages are checksumed and
>> > validated before being added to the log).
>> >
>> > Maintaining this common format allows optimization of the most important
>> > operation: network transfer of persistent log chunks. Modern unix
>> operating
>> > systems offer a highly optimized code path for transferring data out of
>> > pagecache to a socket; in Linux this is done with the sendfile system
>> call.
>> > Java provides access to this system call with the FileChannel.transferTo
>> > api.
>> >
>> > To understand the impact of sendfile, it is important to understand the
>> > common data path for transfer of data from file to socket:
>> >
>> >   1. The operating system reads data from the disk into pagecache in
>> kernel
>> > space
>> >   2. The application reads the data from kernel space into a user-space
>> > buffer
>> >   3. The application writes the data back into kernel space into a socket
>> > buffer
>> >   4. The operating system copies the data from the socket buffer to the
>> NIC
>> > buffer where it is sent over the network
>> >
>> > This is clearly inefficient, there are four copies, two system calls.
>> Using
>> > sendfile, this re-copying is avoided by allowing the OS to send the data
>> > from pagecache to the network directly. So in this optimized path, only
>> the
>> > final copy to the NIC buffer is needed.
>> >
>> > We expect a common use case to be multiple consumers on a topic. Using
>> the
>> > zero-copy optimization above, data is copied into pagecache exactly once
>> > and reused on each consumption instead of being stored in memory and
>> copied
>> > out to kernel space every time it is read. This allows messages to be
>> > consumed at a rate that approaches the limit of the network connection.
>> >
>> >
>> > So in the end it would seem Kafka’s specialized nature to write data
>> first
>> > really shines over Rabbit when your use case requires a very high
>> > throughput unblocking firehose with large data persistence to disk.
>>  Since
>> > this is only one use case this by no means is saying Kafka is better than
>> > Rabbit or vice versa.  I think it is awesome there are more options to
>> > choose from so you can pick the right tool for the job.  Thanks open
>> source!
>> >
>> > As always YMMV.
>> >
>> >
>> >
>> > On Fri, Jun 7, 2013 at 4:40 PM, Alexis Richardson <
>> > alexis.richardson@gmail.com> wrote:
>> >
>> >> Jonathan,
>> >>
>> >>
>> >> On Fri, Jun 7, 2013 at 7:03 PM, Jonathan Hodges <hodgesz@gmail.com>
>> wrote:
>> >> > Hi Alexis,
>> >> >
>> >> > I appreciate your reply and clarifications to my misconception about
>> >> > Rabbit, particularly on the copying of the message payloads per
>> consumer.
>> >>
>> >> Thank-you!
>> >>
>> >>
>> >> >  It sounds like it only copies metadata like the consumer state i.e.
>> >> > position in the topic messages.
>> >>
>> >> Basically yes.  Of course when a message is delivered to N>1
>> >> *machines*, then there will be N copies, one per machine.
>> >>
>> >> Also, for various reasons, very tiny (<60b) messages do get copied as
>> >> you'd assumed.
>> >>
>> >>
>> >> > I don’t have experience with Rabbit and
>> >> > was basing this assumption based on Google searches like the
>> following -
>> >> >
>> >>
>> http://ilearnstack.com/2013/04/16/introduction-to-amqp-messaging-with-rabbitmq/
>> >> .
>> >> >  It seems to indicate with topic exchanges that the messages get
>> copied
>> >> to
>> >> > a queue per consumer, but I am glad you confirmed it is just the
>> >> metadata.
>> >>
>> >> Yup.
>> >>
>> >> That's a fairly decent article but even the good stuff uses words like
>> >> "copy" without a fixed denotation.  Don't believe the internets!
>> >>
>> >>
>> >> > While you are correct the payload is a much bigger concern, managing
>> the
>> >> > metadata and acks centrally on the broker across multiple clients at
>> >> scale
>> >> > is also a concern.  This would seem to be exasperated if you have
>> >> consumers
>> >> > at different speeds i.e. Storm and Hadoop consuming the same topic.
>> >> >
>> >> > In that scenario, say storm consumes the topic messages in real-time
>> and
>> >> > Hadoop consumes once a day.  Let’s assume the topic consists of 100k+
>> >> > messages/sec throughput so that in a given day you might have 100s
>> GBs of
>> >> > data flowing through the topic.
>> >> >
>> >> > To allow Hadoop to consume once a day, Rabbit obviously can’t keep
>> 100s
>> >> GBs
>> >> > in memory and will need to persist this data to its internal DB to
be
>> >> > retrieved later.
>> >>
>> >> I am not sure why you think this is a problem?
>> >>
>> >> For a fixed number of producers and consumers, the pubsub and delivery
>> >> semantics of Rabbit and Kafka are quite similar.  Think of Rabbit as
>> >> adding an in-memory cache that is used to (a) speed up read
>> >> consumption, (b) obviate disk writes when possible due to all client
>> >> consumers being available and consuming.
>> >>
>> >>
>> >> > I believe when large amounts of data need to be persisted
>> >> > is the scenario described in the earlier posted Kafka paper (
>> >> >
>> >>
>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
>> >> )
>> >> > where Rabbit’s performance really starts to bog down as compared
to
>> >> Kafka.
>> >>
>> >> Not sure what parts of the paper you mean?
>> >>
>> >> I read that paper when it came out.  I found it strongest when
>> >> describing Kafka's design philosophy.  I found the performance
>> >> statements made about Rabbit pretty hard to understand.  This is not
>> >> meant to be a criticism of the authors!  I have seen very few
>> >> performance papers about messaging that I would base decisions on.
>> >>
>> >>
>> >> > This Kafka paper is looks to be a few years old
>> >>
>> >> Um....  Lots can change in technology very quickly :-)
>> >>
>> >> Eg.: At the time this paper was published, Instagram had 5m users.
>> >> Six months earlier in Dec 2010, it had 1m.  Since then it grew huge
>> >> and got acquired.
>> >>
>> >>
>> >>
>> >> > so has something changed
>> >> > within the Rabbit architecture to alleviate this issue when large
>> amounts
>> >> > of data are persisted to the internal DB?
>> >>
>> >> Rabbit introduced a new internal flow control system which impacted
>> >> performance under steady load.  This may be relevant?  I couldn't say
>> >> from reading the paper.
>> >>
>> >> I don't have a good reference for this to hand, but here is a post
>> >> about external flow control that you may find amusing:
>> >>
>> >>
>> http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
>> >>
>> >>
>> >> > Do the producer and consumer
>> >> > numbers look correct?  If no, maybe you can share some Rabbit
>> benchmarks
>> >> > under this scenario, because I believe it is the main area where Kafka
>> >> > appears to be the superior solution.
>> >>
>> >> This is from about one year ago:
>> >>
>> >>
>> http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/
>> >>
>> >> Obviously none of this uses batching, which is an easy trick for
>> >> increasing throughput.
>> >>
>> >> YMMV.
>> >>
>> >> Is this helping?
>> >>
>> >> alexis
>> >>
>> >>
>> >>
>> >> > Thanks for educating me on these matters.
>> >> >
>> >> > -Jonathan
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jun 7, 2013 at 6:54 AM, Alexis Richardson <
>> alexis@rabbitmq.com
>> >> >wrote:
>> >> >
>> >> >> Hi
>> >> >>
>> >> >> Alexis from Rabbit here.  I hope I am not intruding!
>> >> >>
>> >> >> It would be super helpful if people with questions, observations
or
>> >> >> moans posted them to the rabbitmq list too :-)
>> >> >>
>> >> >> A few comments:
>> >> >>
>> >> >> * Along with ZeroMQ, I consider Kafka to be one of the interesting
>> and
>> >> >> useful messaging projects out there.  In a world of cruft, Kafka
is
>> >> >> cool!
>> >> >>
>> >> >> * This is because both projects come at messaging from a specific
>> >> >> point of view that is *different* from Rabbit.  OTOH, many other
>> >> >> projects exist that replicate Rabbit features for fun, or NIH,
or due
>> >> >> to misunderstanding the semantics (yes, our docs could be better)
>> >> >>
>> >> >> * It is striking how few people describe those differences.  In
a
>> >> >> nutshell they are as follows:
>> >> >>
>> >> >> *** Kafka writes all incoming data to disk immediately, and then
>> >> >> figures out who sees what.  So it is much more like a database
than
>> >> >> Rabbit, in that new consumers can appear well after the disk write
>> and
>> >> >> still subscribe to past messages.  Instead, Rabbit which tries
to
>> >> >> deliver to consumers and buffers otherwise.  Persistence is optional
>> >> >> but robust and a feature of the buffer ("queue") not the upstream
>> >> >> machinery.  Rabbit is able to cache-on-arrival via a plugin, but
this
>> >> >> is a design overlay and not particularly optimal.
>> >> >>
>> >> >> *** Kafka is a client server system with end to end semantics.
 It
>> >> >> defines order to include processing order, and keeps state on the
>> >> >> client to do this.  Group management is via a 3rd party service
>> >> >> (Zookeeper? I forget which).  Rabbit is a server-only protocol
based
>> >> >> system which maintains order on the server and through completely
>> >> >> language neutral protocol semantics.  This makes Rabbit perhaps
more
>> >> >> natural as a 'messaging service' eg for integration and other
>> >> >> inter-app data transfer.
>> >> >>
>> >> >> *** Rabbit is a general purpose messaging system with extras like
>> >> >> federation.  It speaks many protocols, and has core features like
HA,
>> >> >> transactions, management, etc.  Everything can be switched on or
off.
>> >> >> Getting all this to work while keeping the install light and fast,
is
>> >> >> quite fiddly.  Kafka by contrast comes from a specific set of use
>> >> >> cases, which are interesting certainly.  I am not sure if Kafka
wants
>> >> >> to be a general purpose messaging system, but it will become a
bit
>> >> >> more like Rabbit if that is the goal.
>> >> >>
>> >> >> *** Both approaches have costs.  In the case of Rabbit the cost
is
>> >> >> that more metadata is stored on the broker.  Kafka can get
>> performance
>> >> >> gains by storing less such data.  But we are talking about some
N
>> >> >> thousands of MPS versus some M thousands.  At those speeds the
>> clients
>> >> >> are usually the bottleneck anyway.
>> >> >>
>> >> >> * Let me also clarify some things:
>> >> >>
>> >> >> *** Rabbit does NOT store multiple copies of the same message across
>> >> >> queues, unless they are very small (<60b, iirc).  A message
delivered
>> >> >> to >1 queue on 1 machine is stored once.  Metadata about that
message
>> >> >> may be stored more than once, but, at scale, the big cost is the
>> >> >> payload.
>> >> >>
>> >> >> *** Rabbit's vanilla install does store some index data in memory
>> when
>> >> >> messages flow to disk.  You can change this by using a plugin,
but
>> >> >> this is a secret-menu undocumented feature.  Very very few people
>> need
>> >> >> any such thing.
>> >> >>
>> >> >> *** A Rabbit queue is lightweight.  It's just an ordered consumption
>> >> >> buffer that can persist and ack.  Don't assume things about Rabbit
>> >> >> queues based on what you know about IBM MQ, JMS, and so forth.
>>  Queues
>> >> >> in Rabbit and Kafka are not the same.
>> >> >>
>> >> >> *** Rabbit does not use mnesia for message storage.  It has its
own
>> >> >> DB, optimised for messaging.  You can use other DBs but this is
>> >> >> Complicated.
>> >> >>
>> >> >> *** Rabbit does all kinds of batching and bulk processing, and
can
>> >> >> batch end to end.  If you see claims about batching, buffering,
etc.,
>> >> >> find out ALL the details before drawing conclusions.
>> >> >>
>> >> >> I hope this is helpful.
>> >> >>
>> >> >> Keen to get feedback / questions / corrections.
>> >> >>
>> >> >> alexis
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Jun 7, 2013 at 2:09 AM, Marc Labbe <mrlabbe@gmail.com>
>> wrote:
>> >> >> > We also went through the same decision making and our arguments
for
>> >> Kafka
>> >> >> > where in the same lines as those Jonathan mentioned. The fact
that
>> we
>> >> >> have
>> >> >> > heterogeneous consumers is really a deciding factor. Our
>> requirements
>> >> >> were
>> >> >> > to avoid loosing messages at all cost while having multiple
>> consumers
>> >> >> > reading the same data at a different pace. On one side, we
have a
>> few
>> >> >> > consumers being fed with data coming in from most, if not
all,
>> >> topics. On
>> >> >> > the other side, we have a good bunch of consumers reading
only
>> from a
>> >> >> > single topic. The big guys can take their time to read while
the
>> >> smaller
>> >> >> > ones are mostly for near real-time events so they need to
keep up
>> the
>> >> >> pace
>> >> >> > of incoming messages.
>> >> >> >
>> >> >> > RabbitMQ stores data on disk only if you tell it to while
Kafka
>> >> persists
>> >> >> by
>> >> >> > design. From the beginning, we decided we would try to use
the
>> queues
>> >> the
>> >> >> > same way, pub/sub with a routing key (an exchange in RabbitMQ)
or
>> >> topic,
>> >> >> > persisted to disk and replicated.
>> >> >> >
>> >> >> > One of our scenario was to see how the system would cope with
the
>> >> largest
>> >> >> > consumer down for a while, therefore forcing the brokers to
keep
>> the
>> >> data
>> >> >> > for a long period. In the case of RabbitMQ, this consumer
has it
>> owns
>> >> >> queue
>> >> >> > and data grows on disk, which is not really a problem if you
plan
>> >> >> > consequently. But, since it has to keep track of all messages
read,
>> >> the
>> >> >> > Mnesia database used by RabbitMQ as the messages index also
grows
>> >> pretty
>> >> >> > big. At that point, the amount of RAM necessary becomes very
large
>> to
>> >> >> keep
>> >> >> > the level of performance we need. In our tests, we found that
this
>> an
>> >> >> > adverse effect on ALL the brokers, thus affecting all consumers.
>> You
>> >> can
>> >> >> > always say that you'll monitor the consumers to make sure
it won't
>> >> >> happen.
>> >> >> > That's a good thing if you can. I wasn't ready to make that
bet.
>> >> >> >
>> >> >> > Another point is the fact that, since we wanted to use pub/sub
>> with a
>> >> >> > exchange in RabbitMQ, we would have ended up with a lot data
>> >> duplication
>> >> >> > because if a message is read by multiple consumers, it will
get
>> >> >> duplicated
>> >> >> > in the queue of each of those consumer. Kafka wins on that
side too
>> >> since
>> >> >> > every consumer reads from the same source.
>> >> >> >
>> >> >> > The downsides of Kafka were the language issues (we are using
>> mostly
>> >> >> Python
>> >> >> > and C#). 0.8 is very new and few drivers are available at
this
>> point.
>> >> >> Also,
>> >> >> > we will have to try getting as close as possible to
>> once-and-only-once
>> >> >> > guarantee. There are two things where RabbitMQ would have
given us
>> >> less
>> >> >> > work out of the box as opposed to Kafka. RabbitMQ also provides
a
>> >> bunch
>> >> >> of
>> >> >> > tools that makes it rather attractive too.
>> >> >> >
>> >> >> > In the end, looking at throughput is a pretty nifty thing
but being
>> >> sure
>> >> >> > that I'll be able to manage the beast as it grows will allow
me to
>> >> get to
>> >> >> > sleep way more easily.
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jun 6, 2013 at 3:28 PM, Jonathan Hodges <hodgesz@gmail.com
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> >> We just went through a similar exercise with RabbitMQ
at our
>> company
>> >> >> with
>> >> >> >> streaming activity data from our various web properties.
 Our use
>> >> case
>> >> >> >> requires consumption of this stream by many heterogeneous
>> consumers
>> >> >> >> including batch (Hadoop) and real-time (Storm).  We pointed
out
>> that
>> >> >> Kafka
>> >> >> >> acts as a configurable rolling window of time on the activity
>> stream.
>> >> >>  The
>> >> >> >> window default is 7 days which allows for supporting clients
of
>> >> >> different
>> >> >> >> latencies like Hadoop and Storm to read from the same
stream.
>> >> >> >>
>> >> >> >> We pointed out that the Kafka brokers don't need to maintain
>> consumer
>> >> >> state
>> >> >> >> in the stream and only have to maintain one copy of the
stream to
>> >> >> support N
>> >> >> >> number of consumers.  Rabbit brokers on the other hand
have to
>> >> maintain
>> >> >> the
>> >> >> >> state of each consumer as well as create a copy of the
stream for
>> >> each
>> >> >> >> consumer.  In our scenario we have 10-20 consumers and
with the
>> scale
>> >> >> and
>> >> >> >> throughput of the activity stream we were able to show
Rabbit
>> quickly
>> >> >> >> becomes the bottleneck under load.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Jun 6, 2013 at 12:40 PM, Dragos Manolescu <
>> >> >> >> Dragos.Manolescu@servicenow.com> wrote:
>> >> >> >>
>> >> >> >> > Hi --
>> >> >> >> >
>> >> >> >> > I am preparing to make a case for using Kafka instead
of Rabbit
>> MQ
>> >> as
>> >> >> a
>> >> >> >> > broker-based messaging provider. The context is similar
to that
>> of
>> >> the
>> >> >> >> > Kafka papers and user stories: the producers publish
monitoring
>> >> data
>> >> >> and
>> >> >> >> > logs, and a suite of subscribers consume this data
(some store
>> it,
>> >> >> others
>> >> >> >> > perform computations on the event stream). The requirements
are
>> >> >> typical
>> >> >> >> of
>> >> >> >> > this context: low-latency, high-throughput, ability
to deal with
>> >> >> bursts
>> >> >> >> and
>> >> >> >> > operate in/across multiple data centers, etc.
>> >> >> >> >
>> >> >> >> > I am familiar with the performance comparison between
Kafka,
>> >> Rabbit MQ
>> >> >> >> and
>> >> >> >> > Active MQ from the NetDB 2011 paper<
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
>> >> >> >> >.
>> >> >> >> > However in the two years that passed since then the
number of
>> >> >> production
>> >> >> >> > Kafka installations increased, and people are using
it in
>> different
>> >> >> ways
>> >> >> >> > than those imagined by Kafka's designers. In light
of these
>> >> >> experiences
>> >> >> >> one
>> >> >> >> > can use more data points and color when contrasting
to Rabbit MQ
>> >> >> (which
>> >> >> >> by
>> >> >> >> > the way also evolved since 2011). (And FWIW I know
I am not the
>> >> first
>> >> >> one
>> >> >> >> > to walk this path; see for example last year's OSCON
session on
>> the
>> >> >> State
>> >> >> >> > of MQ<http://lanyrd.com/2012/oscon/swrcz/>.)
>> >> >> >> >
>> >> >> >> > I would appreciate it if you could share measurements,
results,
>> or
>> >> >> even
>> >> >> >> > anecdotal evidence along these lines. How have you
avoided the
>> >> "let's
>> >> >> use
>> >> >> >> > Rabbit MQ because everybody else does it" route when
solving
>> >> problems
>> >> >> for
>> >> >> >> > which Kafka is a better fit?
>> >> >> >> >
>> >> >> >> > Thanks,
>> >> >> >> >
>> >> >> >> > -Dragos
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>>

Mime
View raw message