kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Smith <esm...@stardotstar.org>
Subject Re: Embedding a broker into a producer?
Date Thu, 12 Apr 2012 16:47:32 GMT

  Thanks for sharing your architecture.  We are in a similar boat, as
our current datastream is written to files first, and then the kafka
producer can read/transmit those.

What do you see as the downside to running a KAFKA Broker and letting
it write your files locally?

I'm new to kafka, so just exploring ideas here:

producer-side broker downsides:
1.  Heavier memory/processing footprint than just a producer

producer-side broker upsides:
1.  eliminates the middle man, you essentially have peer-to-peer
operation between producers and consumers, with ZK as the coordinator.
 For me, this is big, since I don't have to worry about High
Availability (HA) for the brokers.
2.  eliminates duplicating data on disk at both the producer and the broker.
3.  Data has been demultiplexed into it's topics when it is on disk at
the producer/broker.  This means that I can purge data based on
per-topic policies (Our data arrives multiplexed and has to be split
into topics, we also run into out-of-storage during a network outage).

In the research I've been doing, this is the model proposed by the 0mq
(zeromq) folks, I think.  Its just that all of the wiring is already
written in kafka.


On Thu, Apr 12, 2012 at 12:33 PM, Niek Sanders <niek.sanders@gmail.com> wrote:
> Dealing with network/broker outage on the producer side is also
> something that I've been trying to solve.
> Having a hook for the producer to dump to a local file would probably
> be the simplest solution.  In the event of a prolonged outage, this
> file could be replayed once availability is restored.
> The current approach I've been taking:
> 1) My bridge code between my data source and the Kafka producer writes
> everything to a local log files.  When this bridge starts up, it
> generates a unique 8 character alphanumeric string.  For each log
> entry it writes to the local file, it prefixes both the alphanumeric
> string and a log line number (0,1,2,3,....).  The data already has
> timestamps coming with it.
> 2) In the event of a network outage or Kafka being unable to keep up
> with the producer, I simply drop the Kafka messages.  I never allow my
> data source to be blocked because I'm waiting on Kafka
> producer/broker.
> 3) For given time ranges, my consumers track all the alphanumeric
> identifiers that they consumed and the maximum complete sequence
> number that they have seen.
> So I can manually go back to producers and replay any lost data.
> (Whether it was never sent because of network outage or if it died
> with a broker hardware failure).
> I basically go to the producer machine (which I track in the Kafka
> message body) and say: for time A to time B, I received data for these
> identifiers and max sequence numbers (najeh2wh, 12312), (ji3njdKL,
> 71).  Replay anything that I'm missing.
> I use random identifier strings because it saves me from having to
> persist the number of log lines my producer has generated.
> (Robustness against producer failure).
> - Niek
> On Thu, Apr 12, 2012 at 7:12 AM, Edward Smith <esmith@stardotstar.org> wrote:
>> Jun/Eric,
>> [snip]
>>  However, we have a requirement to support HA.  If I stick with the
>> approach above, I have to worry about replication/mirroring the
>> queues, which always gets sticky.   We have to handle the case where a
>> producer loses network connectivity, and so, must be able to queue
>> locally at the producer, which, I believe either means put the KAFKA
>> broker here or continue to use some 'homebrew'  local queue.  With
>> brokers on the same node as producers, consumers only have to HA the
>> results of their processing and I don't have to HA the queues.
>>  Any thoughts or feedback from the group is welcome.
>> Ed

View raw message