spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <>
Subject [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.
Date Wed, 12 Jun 2019 11:21:25 GMT
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.

This results in an unbounded creation of tiny files that eat away storage
by the block and, in our case, deteriorates the file system performance.

We correlated the processedRowsPerSecond reported by the
StreamingQueryProgress against a count of the .crc files in the storage
directory (checkpoint + state store). The performance impact we observe is

We are running on Kubernetes, using GlusterFS as the shared storage
[image: out processedRowsPerSecond vs. files in storage_process.png]
I have created a JIRA ticket with additional detail:

This is also related to an earlier discussion about the state store
unbounded disk-size growth, which was left unresolved back then:

If there's any additional detail I should add/research, please let me know.

kind regards, Gerard.

View raw message