kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: How does the /tmp/kafka-streams folder work?
Date Fri, 28 Dec 2018 15:54:29 GMT
> (only last message per key is kept and others are discarded)

Exactly: https://kafka.apache.org/documentation/#compaction


-Matthias

On 12/28/18 4:23 PM, Peter Levart wrote:
> Thanks Matthias for clarifying the change log mechanics. So compacting
> the change log is not some kind of "compressing", but actually getting
> rid of information that is not needed for reconstruction of local store?
> Great. And kafka brokers are already knowledgeable enough to make this
> happen? How does this work? Is the scheme using message keys to decide
> what to keep (only last message per key is kept and others are discarded)?
> 
> Regards, Peter
> 
> On 12/28/18 4:13 PM, Matthias J. Sax wrote:
>>>> Is it really necessary to keep the whole log topic of a particular
>>>> local
>>>> store? Such log will grow indefinitely. Replay of such log will take
>>>> more and more time. Do kafka streams write just changelog to such topic
>>>> or do they eventually write a snapshot of the current store too?
>> Yes, it is necessary to keep the log, however, it won't grow
>> indefinitely because of log compaction. Also, replying the log is bound
>> by the number of keys in the log if compacted. Because Kafka Streams
>> uses change-logging as fault-tolerant mechanism, it's not required to
>> snapshot the local stores additionally --- basically, the compacted
>> changelog is the same thing as a snapshot.
>>
>> Does this make sense?
>>
>>>> If I configure 'num.standby.replicas' with number > 0, will such
>>>> replicas keep its own synchronized local store on disk? In that case
>>>> loosing an active stream processor will fallback to standby processor
>>>> and the log will only be replayed from the offset that has not been
>>>> applied yet to the local store of the stand-by processor,
>> Yes, that is the idea of standby replicas.
>>
>>>> meaning that
>>>> we don't need to keep the whole log. But we need to be sure not to
>>>> loose
>>>> all local stores of all stand-by and active processor(s) then.
>> No. Standby replicas are independent of the changelog we need to keep
>> (cf. above).
>>
>>>> In case a
>>>> stand-by processor that has been chosen to be promoted to active role
>>>> does not have local store any more, the whole log will be needed again,
>>>> right?
>> Yes, the whole changelog will be replayed -- note again, that the size
>> of the changelog is proportional to the size of the state due to log
>> compaction.
>>
>>>> Suppose that the store that is needed is a WindowStore which only keeps
>>>> data for a limited number of past windows. Would the "useable" part of
>>>> such store be possible to reconstruct from the limited number of past
>>>> log records so that full log would not be necessary?
>> It works the same as for non-windowed stores. The changelog keeps the
>> latest update for each window. Older windows that pass the retention
>> time are not maintained any longer and are deleted. However, the
>> compacted changelog topic keeps the latest update per window and thus,
>> the full state can be recreated without any data loss.
>>
>>
>> -Matthias
>>
>> On 12/28/18 1:42 PM, Peter Levart wrote:
>>> Hi Matthias,
>>>
>>> Just a couple of questions about that...
>>>
>>> On 12/27/18 9:57 PM, Matthias J. Sax wrote:
>>>> All data is backed in the Kafka cluster. Data that is stored
>>>> locally, is
>>>> basically a cache, and Kafka Streams will recreate the local data if
>>>> you
>>>> loose it.
>>>>
>>>> Thus, I am not sure how the KTable data could be stale. One possibility
>>>> might be a miss-configuration: I assume that you read the topic
>>>> directly
>>>> as a table (ie, builder.table("topic")). If you do this, the used input
>>>> topic must be configured with log compaction --- if it is configured
>>>> with retention, you might loose data from the input topic and if you
>>>> also loose the local cache, Kafka Streams cannot recreate the local
>>>> state because it was deleted from the topic (log compaction will guard
>>>> the input topic from data loss).
>>> Is it really necessary to keep the whole log topic of a particular local
>>> store? Such log will grow indefinitely. Replay of such log will take
>>> more and more time. Do kafka streams write just changelog to such topic
>>> or do they eventually write a snapshot of the current store too?
>>>
>>> If I configure 'num.standby.replicas' with number > 0, will such
>>> replicas keep its own synchronized local store on disk? In that case
>>> loosing an active stream processor will fallback to standby processor
>>> and the log will only be replayed from the offset that has not been
>>> applied yet to the local store of the stand-by processor, meaning that
>>> we don't need to keep the whole log. But we need to be sure not to loose
>>> all local stores of all stand-by and active processor(s) then. In case a
>>> stand-by processor that has been chosen to be promoted to active role
>>> does not have local store any more, the whole log will be needed again,
>>> right?
>>>
>>> Suppose that the store that is needed is a WindowStore which only keeps
>>> data for a limited number of past windows. Would the "useable" part of
>>> such store be possible to reconstruct from the limited number of past
>>> log records so that full log would not be necessary?
>>>
>>> Regards, Peter
>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 12/24/18 12:22 PM, Edmondo Porcu wrote:
>>>>> Hello Kafka users,
>>>>>
>>>>> we are running a Kafka Streams as a fully stateless application,
>>>>> meaning
>>>>> that we are not persisting /tmp/kafka-streams on a durable volume
>>>>> but we
>>>>> are rather losing it at each restart. This application is performing
a
>>>>> KTable-KTable join of data coming from Kafka Connect, and sometimes
>>>>> we want
>>>>> to force the output to tick so we update records in the right table
>>>>> from
>>>>> the database, but we see that the left table is "stale".
>>>>>
>>>>> Is it possible that because of reboots, the application loses some
>>>>> messages
>>>>> ? How is the state reconstructed when /tmp/kafka-streams is not
>>>>> available?
>>>>> Is the state saved in an intermediate topic?
>>>>>
>>>>> Thanks,
>>>>> Edmondo
>>>>>
> 


Mime
View raw message