samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Lebedev <>
Subject changelog compaction problem
Date Wed, 29 Jul 2015 15:30:34 GMT

I have a problem with changelog in one of my samza jobs grows indefinitely.

The job is quite simple, it reads messages from the input kafka topic, 
and either creates or updates a key in task-local samza store. Once in a 
minute the window method kicks-in, it iterates over all keys in the 
store and deletes some of them, selecting on the contents of their value.

Message rate in input topic is about 3000 messages per second. The input 
topic is partitioned in 48 partitions. Average number of keys, kept in 
the store is more or less stable and do not exceed 10000 keys per task. 
Average size of values is 50 bytes. So I expected that sum of all 
segments' size in kafka data directory for the job's changelog topic 
should not exceed 10000*50*48 ~= 24Mbytes. In fact it is more than 2.5GB 
(after 6 days running from scratch) and it is growing.

I tried to change default segment size for changelog topic in kafka, and 
it worked a bit - instead of 500Mbyte segments I have now 50Mbyte 
segments, but it did not heal the indefinite data growth problem.

Moreover, if I stop the job and start it again it cannot restart, it 
breaks right after reading all records from changelog topic.

Did somebody have similar problem? How it could be resolved?

Best regards,

Vladimir Lebedev

View raw message