kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eno Thereska <eno.there...@gmail.com>
Subject Re: Windowed aggregations memory requirements
Date Wed, 03 May 2017 19:06:32 GMT
This is a timely question and we've updated the documentation here on capacity planning and
sizing for Kafka Streams jobs: http://docs.confluent.io/current/streams/sizing.html <http://docs.confluent.io/current/streams/sizing.html>.
Any feedback welcome. It has scenarios with windowed stores too.

Thanks
Eno
> On 3 May 2017, at 18:51, Garrett Barton <garrett.barton@gmail.com> wrote:
> 
> That depends on if your using event, processing or ingestion time.
> 
> My understanding is that if you play a record through that is T-6, the only
> way that TimeWindows.of(TimeUnit.MINUTES.toMillis(1)).until(TimeUnit.MINUTES.toMillis(5))
> would actually process that record in your window is if your using
> processing time.  Otherwise the record is skipped and no data is
> generated/calculated for that operation.  So depending on what your doing
> you would not increase any more memory usage than when consuming from
> real-time.
> 
> On Wed, May 3, 2017 at 3:37 AM, João Peixoto <joao.hartimer@gmail.com>
> wrote:
> 
>> The base question I'm trying to answer is "how much memory does my instance
>> need".
>> 
>> Considering a use case where I want to keep a rolling average on a tumbling
>> window of 1 minute size allowing for late arrivals up to 5 minutes (lower
>> bound) we would have something like this:
>> 
>> TimeWindows.of(TimeUnit.MINUTES.toMillis(1)).until(
>> TimeUnit.MINUTES.toMillis(5))
>> 
>> The aggregate key size is 8 bytes, the average value is 8 bytes and for
>> de-duplication purposes we need to keep track of which messages we saw
>> already, so a list of keys averaging 10 entries.
>> 
>> If I understand correctly this means that each window will be on average 96
>> bytes in size.
>> 
>> A single topic generates 100 messages/minute, which aggregate into 10
>> independent windows.
>> 
>> On any given point in time the windowed aggregates require 960 bytes of
>> memory at least.
>> 
>> Here's the confusing part. Lets say I found an issue with my averaging
>> operation and I want to reprocess the last 10 hours worth of messages.
>> 
>> 1. Windows will be regenerated, since most likely they were cleaned up
>> already
>> 2. The retention policy now has different semantics? If I had a late
>> arrival of 6 minutes, all of the sudden the reprocessing will account for
>> it right? Since the window is now active due to recreation (Assuming my app
>> is capable of processing all messages under 5 minutes)
>> 3. I'll be keeping 10 windows * (60 * 10) minutes for the first 5 minutes,
>> so my memory requirement is now 576,000 bytes? This is orders of magnitude
>> bigger.
>> 
>> I hope this gets my doubts across clearly, feel free to ask more details.
>> And thanks in advance
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message