kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: How does the /tmp/kafka-streams folder work?
Date Fri, 28 Dec 2018 15:13:15 GMT
>> Is it really necessary to keep the whole log topic of a particular local
>> store? Such log will grow indefinitely. Replay of such log will take
>> more and more time. Do kafka streams write just changelog to such topic
>> or do they eventually write a snapshot of the current store too?

Yes, it is necessary to keep the log, however, it won't grow
indefinitely because of log compaction. Also, replying the log is bound
by the number of keys in the log if compacted. Because Kafka Streams
uses change-logging as fault-tolerant mechanism, it's not required to
snapshot the local stores additionally --- basically, the compacted
changelog is the same thing as a snapshot.

Does this make sense?

>> If I configure 'num.standby.replicas' with number > 0, will such
>> replicas keep its own synchronized local store on disk? In that case
>> loosing an active stream processor will fallback to standby processor
>> and the log will only be replayed from the offset that has not been
>> applied yet to the local store of the stand-by processor,

Yes, that is the idea of standby replicas.

>> meaning that
>> we don't need to keep the whole log. But we need to be sure not to loose
>> all local stores of all stand-by and active processor(s) then.

No. Standby replicas are independent of the changelog we need to keep
(cf. above).

>> In case a
>> stand-by processor that has been chosen to be promoted to active role
>> does not have local store any more, the whole log will be needed again,
>> right?

Yes, the whole changelog will be replayed -- note again, that the size
of the changelog is proportional to the size of the state due to log
compaction.

>> Suppose that the store that is needed is a WindowStore which only keeps
>> data for a limited number of past windows. Would the "useable" part of
>> such store be possible to reconstruct from the limited number of past
>> log records so that full log would not be necessary?

It works the same as for non-windowed stores. The changelog keeps the
latest update for each window. Older windows that pass the retention
time are not maintained any longer and are deleted. However, the
compacted changelog topic keeps the latest update per window and thus,
the full state can be recreated without any data loss.


-Matthias

On 12/28/18 1:42 PM, Peter Levart wrote:
> Hi Matthias,
> 
> Just a couple of questions about that...
> 
> On 12/27/18 9:57 PM, Matthias J. Sax wrote:
>> All data is backed in the Kafka cluster. Data that is stored locally, is
>> basically a cache, and Kafka Streams will recreate the local data if you
>> loose it.
>>
>> Thus, I am not sure how the KTable data could be stale. One possibility
>> might be a miss-configuration: I assume that you read the topic directly
>> as a table (ie, builder.table("topic")). If you do this, the used input
>> topic must be configured with log compaction --- if it is configured
>> with retention, you might loose data from the input topic and if you
>> also loose the local cache, Kafka Streams cannot recreate the local
>> state because it was deleted from the topic (log compaction will guard
>> the input topic from data loss).
> 
> Is it really necessary to keep the whole log topic of a particular local
> store? Such log will grow indefinitely. Replay of such log will take
> more and more time. Do kafka streams write just changelog to such topic
> or do they eventually write a snapshot of the current store too?
> 
> If I configure 'num.standby.replicas' with number > 0, will such
> replicas keep its own synchronized local store on disk? In that case
> loosing an active stream processor will fallback to standby processor
> and the log will only be replayed from the offset that has not been
> applied yet to the local store of the stand-by processor, meaning that
> we don't need to keep the whole log. But we need to be sure not to loose
> all local stores of all stand-by and active processor(s) then. In case a
> stand-by processor that has been chosen to be promoted to active role
> does not have local store any more, the whole log will be needed again,
> right?
> 
> Suppose that the store that is needed is a WindowStore which only keeps
> data for a limited number of past windows. Would the "useable" part of
> such store be possible to reconstruct from the limited number of past
> log records so that full log would not be necessary?
> 
> Regards, Peter
> 
>>
>>
>> -Matthias
>>
>>
>> On 12/24/18 12:22 PM, Edmondo Porcu wrote:
>>> Hello Kafka users,
>>>
>>> we are running a Kafka Streams as a fully stateless application, meaning
>>> that we are not persisting /tmp/kafka-streams on a durable volume but we
>>> are rather losing it at each restart. This application is performing a
>>> KTable-KTable join of data coming from Kafka Connect, and sometimes
>>> we want
>>> to force the output to tick so we update records in the right table from
>>> the database, but we see that the left table is "stale".
>>>
>>> Is it possible that because of reboots, the application loses some
>>> messages
>>> ? How is the state reconstructed when /tmp/kafka-streams is not
>>> available?
>>> Is the state saved in an intermediate topic?
>>>
>>> Thanks,
>>> Edmondo
>>>
> 


Mime
View raw message