spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (Jira)" <>
Subject [jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
Date Wed, 09 Sep 2020 02:29:00 GMT


Jungtaek Lim commented on SPARK-24295:


Thanks for sharing the workaround. I've proposed applying TTL on FileStreamSink output, which
does the similar with your workaround, but purges for every compact batch. Unfortunately it
hasn't made enough interest for committers, though.

SPARK-27188 ([])

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> ------------------------------------------------------------------------
>                 Key: SPARK-24295
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Iqbal Singh
>            Priority: Major
>         Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
> FileStreamSinkLog metadata logs are concatenated to a single compact file after defined
compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing slowness 
while reading the data from FileStreamSinkLog dir as spark is defaulting to the "__spark__metadata"
dir for the read.
> We need a functionality to purge the compact file size.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message