kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ole Koenecke <o.koene...@kasasi.de>
Subject stale RocksDB store files are not getting cleaned up
Date Wed, 10 Apr 2019 14:47:32 GMT
Hi all,
I have a problem with my kafka-streams (2.1.1) application. Sorry for being vague, but I couldn‘t
find more information than the following:
Most of the times my services are running just fine, but sometimes (I cannot put my finger
on a precise trigger) the .sst files of more or less random services are not getting cleaned
up anymore. The number just keeps growing until I restart the specific service or reach the
file limit of my server. It seems that services using more state stores are getting affected
more often.

What I could observe is, that there is always „an event“ before this is happening. Yesterday
for example we had to shut down one of our brokers and the consumers logged:
Received invalid metadata error in produce request on partition my-store-changelog-16 due
to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader
for that topic-partition.. Going to request metadata update now

Although 6 instances of that service logged this message, only 2 of them started piling up
.sst files. All 6 kept working.

Some days ago the affected services logged following message before the file descriptor count
started rising:
Failed to commit stream task 0_17 since it got migrated to another thread already. Closing
it as zombie before triggering a new rebalance.
Detected task 0_17 that got migrated to another thread. This implies that this thread missed
a rebalance and dropped out of the consumer group. Will try to rejoin the consumer group.
Below is the detailed description of the task: …

I already checked https://github.com/facebook/rocksdb/wiki/Delete-Stale-Files and had a look
for leaking iterators in our code. Couldn’t find any + if we had a resource leak the problem
would occur all the time, I guess? I found this old issue https://github.com/apache/kafka/commit/2b431b551252a65113cb720b102a2f3e8b301099
and thought it looked a lot like mine. Could there be a rare case of resource/iterator leak,
if a producer has to update itself?

I hope someone might have an idea where I could start looking,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message