samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Kreps <>
Subject window-based retention for state management
Date Tue, 10 Sep 2013 16:12:45 GMT
Here is another half-baked idea. :-)

The key-value storage engine maintains values until they are overwritten or
deleted. Here are two very common use cases that could potentially be
optimized significantly by having a storage engine that allowed maintaining
data for a window of time:
1. Stream join. This is the case where you have two nearly aligned streams
like clicks and impressions and you want to join them. You want to store
clicks until a corresponding impression arrives or impressions until a
corresponding click arrives. Since not every impression has a click you
want to eventually time these out and discard them.
2. Buffered sort: A fast key-value storage engine that supported range
queries could be used to implement a buffered sort.

The implementation in something like leveldb is actually quite simple: in
addition to garbage collection you just discard data segments after a
period of time.

This collection mechanism works well with the changelogs too: obviously
kafka already supports a time-based retention.

This gets rid of the need for garbage collection in the log and also gets
rid of the need to log deletes.

You would ideally want the retention to be deterministic on the storage
engine and the kafka topic to retain more data than the storage engine so a
complete restore is always possible.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message