spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <>
Subject Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0
Date Tue, 21 Apr 2020 21:45:07 GMT
You're hitting an existing issue While there's no active
PR to address it, I've been planning to take a look sooner than later.

Btw, you may also want to take a look at my previous mail - the topic on
the mail thread was regarding file stream sink metadata growing bigger, but
in fact that's basically the same issue, so you may get some information
from there. (tl;dr. I have bunch of PRs for addressing multiple issues on
file stream source and sink, just having lack of some love.)

Jungtaek Lim (HeartSaVioR)

On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav <> wrote:

> Hi Team,
> While Running Spark Below are some finding.
>    1. FileStreamSourceLog is responsible for maintaining input source
>    file list.
>    2. Spark Streaming delete expired log files on the basis of s
>    *park.sql.streaming.fileSource.log.deletion* and
>    *spark.sql.streaming.minBatchesToRetain.*
>    3. But while compacting logs Spark Streaming write the complete list
>    of files streaming has seen till now in HDFS into one single .compact file.
>    4. Over the course of time this compact file  is consuming around
>    2GB-5GB in HDFS which will delay creation of compact file after every 10th
>    Batch and also job restart time will increase.
>    5. Why Spark Streaming is logging files in the system which are
>    already deleted . While creating compact file there must be some configured
>    timeout so that Spark can skip writing expired list of input files.
> *Also kindly let me know if i missed something and there is some
> configuration already present to handle this. *
> Regards
> Pappu Yadav

View raw message