kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ara Ebrahimi <ara.ebrah...@argyledata.com>
Subject Re: micro-batching in kafka streams
Date Mon, 26 Sep 2016 17:40:22 GMT
Hi,

So, here’s the situation:

- for classic batching of writes to external systems, right now I simply hack it. This specific
case is writing of records to Accmumlo database, and I simply use the batch writer to batch
writes, and it flushes every second or so. I’ve added a shutdown hook to the jvm to flush
upon graceful exit too. This is good enough for me, but obviously it’s not perfect. I wish
Kafka Streams had some sort of a trigger (based on x number of records processed, or y window
of time passed). Which brings me to the next use case.

- I have some logic for calculating hourly statistics. So I’m dealing with Windowed data
already. These stats then need to be written to an external database for use by user facing
systems. Obviously I need to write the final result for each hourly window after we’re past
that window of time (or I can write as often as it gets updated but the problem is that the
external database is not as fast as Kafka). I do understand that I need to take into account
the fact that events may arrive out of order and there may be some records arriving a little
bit after I’ve considered the previous window over and have moved to the next one. I’d
like to have some sort of an hourly trigger (not just pure x milliseconds trigger, but also
support for cron style timing) and then also have the option to update the stats I’ve already
written for a window a set amount of time after the trigger got triggered so that I can deal
with events which arrive after the write for that window. And then there’s a cut-off point
after which updating the stats for a very old window is just not worth it. Something like
this DSL:

kstream.trigger(/* when to trigger */ Cron.of(“0 * * * *”), /* update every hour afterwards
*/ Hours.toMillis(1), /* discard changes older than this */ Hours.toMillis(24), /* lambda
*/ (windowStartTime, windowedKey, record) -> { /* write */ } );

The tricky part is reconciling event source time and event processing time. Clearly this trigger
is in the event processing time whereas the data is in the event source time most probably.

Something like that :)

Ara.

> On Sep 26, 2016, at 1:59 AM, Michael Noll <michael@confluent.io> wrote:
>
> Ara,
>
> may I ask why you need to use micro-batching in the first place?
>
> Reason why I am asking: Typically, when people talk about micro-batching,
> they are refer to the way some originally batch-based stream processing
> tools "bolt on" real-time processing by making their batch sizes really
> small.  Here, micro-batching belongs to the realm of the inner workings of
> the stream processing tool.
>
> Orthogonally to that, you have features/operations such as windowing,
> triggers, etc. that -- unlike micro-batching -- allow you as the user of
> the stream processing tool to define which exact computation logic you
> need.  Whether or not, say, windowing is or is not computed via
> micro-batching behind the scenes should (at least in an ideal world) be of
> no concern to the user.
>
> -Michael
>
>
>
>
>
> On Mon, Sep 5, 2016 at 9:10 PM, Ara Ebrahimi <ara.ebrahimi@argyledata.com>
> wrote:
>
>> Hi,
>>
>> What’s the best way to do micro-batching in Kafka Streams? Any plans for a
>> built-in mechanism? Perhaps StateStore could act as the buffer? What
>> exactly are ProcessorContext.schedule()/punctuate() for? They don’t seem
>> to be used anywhere?
>>
>> http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/
>>
>> Ara.
>>
>>
>>
>> ________________________________
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Thank you in
>> advance for your cooperation.
>>
>> ________________________________
>>
>
>
>
> ________________________________
>
> This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise confidential information. If you have received it in error, please notify the
sender immediately and delete the original. Any other use of the e-mail by you is prohibited.
Thank you in advance for your cooperation.
>
> ________________________________




________________________________

This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise confidential information. If you have received it in error, please notify the
sender immediately and delete the original. Any other use of the e-mail by you is prohibited.
Thank you in advance for your cooperation.

________________________________
Mime
View raw message