spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.
Date Wed, 12 Jun 2019 11:22:44 GMT
Ooops - linked the wrong JIRA ticket:  (that other one is related)

https://issues.apache.org/jira/browse/SPARK-28025

On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager leaks .crc files by
> leaving them behind after users of this class (like
> HDFSBackedStateStoreProvider) apply their cleanup methods.
>
> This results in an unbounded creation of tiny files that eat away storage
> by the block and, in our case, deteriorates the file system performance.
>
> We correlated the processedRowsPerSecond reported by the
> StreamingQueryProgress against a count of the .crc files in the storage
> directory (checkpoint + state store). The performance impact we observe is
> dramatic.
>
> We are running on Kubernetes, using GlusterFS as the shared storage
> provider.
> [image: out processedRowsPerSecond vs. files in storage_process.png]
> I have created a JIRA ticket with additional detail:
>
> https://issues.apache.org/jira/browse/SPARK-17475
>
> This is also related to an earlier discussion about the state store
> unbounded disk-size growth, which was left unresolved back then:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>
> If there's any additional detail I should add/research, please let me know.
>
> kind regards, Gerard.
>
>
>

Mime
View raw message