spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <>
Subject Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source
Date Mon, 28 Sep 2020 01:11:19 GMT
bump to see anyone interested or concerned about this.

On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim <>

> Bump this again.
> On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim <
>> wrote:
>> Bump again.
>> Unlike file stream sink which has lots of limitations and many of us have
>> been suggesting alternatives, file stream source is the only way if end
>> users want to read the data from files. No alternative unless they
>> introduce another ETL & storage (probably Kafka).
>> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim <
>>> wrote:
>>> Hi German,
>>> option 1 isn't about "deleting" the old files, as your input directory
>>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>>> input data hence possible to apply retention without problem.
>>> option 1 is more about "hiding" the old files being read, so that end
>>> users "may" be able to delete the files once they ensure "all queries
>>> accessing the input directory" don't see the old files.
>>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon <
>>>> wrote:
>>>> HI Jungtaek,
>>>> I have a question, aren't both approaches compatible?
>>>> How I see it, I think It would be interesting to have a retention
>>>> period to delete old files and/or the possibility of indicating an offset
>>>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>>> WDYT?
>>>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim <
>>>>> wrote:
>>>>> (I'd like to keep the discussion thread focusing on the specific topic
>>>>> - let's initiate another discussion threads on different topics.)
>>>>> Thanks for the input. I'd like to emphasize that the point in
>>>>> discussion is the "latestFirst" option - the rationalization starts from
>>>>> growing metadata log issues. I hope your input is picking option 2, but
>>>>> could you please make clear your input represents OK to "replace" the
>>>>> "latestFirst" option with "starting from timestamp"?
>>>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
>>>>>> wrote:
>>>>>> If we compare file-stream source with other streaming sources such
>>>>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>>>>> from a custom offset/particular point of time is something that is
>>>>>> Typically filestream sources don't have auto-deletion of the older
>>>>>> data/files. In kafka we can define the retention period. So even
if we use
>>>>>> "Earliest" we won't end up reading from the time when the Kafka topic
>>>>>> created. On the other hand, streaming sources can hold very old files.
>>>>>> very valid use-cases to read the bulk of the old files using a batch
>>>>>> until a particular timestamp. And then use streaming jobs for real-time
>>>>>> updates.
>>>>>> So having support where we can specify a timestamp. and we would
>>>>>> consider files created post that timestamp can be useful.
>>>>>> Another concern which we need to consider is the listing cost. is
>>>>>> there any way we can avoid listing the entire base directory and
>>>>>> filtering out the new files. if the data is organized as partitions
>>>>>> date, will it help to list only those partitions where new files
>>>>>> added?
>>>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>>>>>> wrote:
>>>>>>> bump, is there any interest on this topic?
>>>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>>>>>> wrote:
>>>>>>>> (Just to add rationalization, you can refer the original
>>>>>>>> thread on dev@ list to see efforts on addressing problems
in file
>>>>>>>> stream source / sink -
>>>>>>>> )
>>>>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>>>>>>>> wrote:
>>>>>>>>> Hi devs,
>>>>>>>>> As I have been going through the various issues on metadata
>>>>>>>>> growing, it's not only the issue of sink, but also the
issue of source.
>>>>>>>>> Unlike sink metadata log which entries should be available
to the
>>>>>>>>> readers, the source metadata log is only for the streaming
query starting
>>>>>>>>> from the checkpoint, hence in theory it should only memorize
>>>>>>>>> minimal entries which prevent processing multiple times
on the same file.
>>>>>>>>> This is not applied to the file stream source, and I
think it's
>>>>>>>>> because of the existence of the "latestFirst" option
which I haven't seen
>>>>>>>>> from any sources. The option works as reading files in
"backward" order,
>>>>>>>>> which means Spark can read the oldest file and latest
file together in a
>>>>>>>>> micro-batch, which ends up having to memorize all files
previously read.
>>>>>>>>> The option can be changed during query restart, so even
if the query is
>>>>>>>>> started with "latestFirst" being false, it's not safe
to apply the logic of
>>>>>>>>> minimizing entries to memorize, as the option can be
changed to true and
>>>>>>>>> then we'll read files again.
>>>>>>>>> I'm seeing two approaches here:
>>>>>>>>> 1) apply "retention" - unlike "maxFileAge", the option
would apply
>>>>>>>>> to latestFirst as well. That said, if the retention is
set to 7 days, the
>>>>>>>>> files older than 7 days would never be read in any way.
With this approach
>>>>>>>>> we can at least get rid of entries which are older than
retention. The
>>>>>>>>> issue is how to play nicely with existing "maxFileAge",
as it also plays
>>>>>>>>> similar with the retention, though it's being ignored
when latestFirst is
>>>>>>>>> turned on. (Change the semantic of "maxFileAge" vs leave
it to "soft
>>>>>>>>> retention" and introduce another option.)
>>>>>>>>> (This approach is being proposed under SPARK-17604, and
PR is
>>>>>>>>> available -
>>>>>>>>> 2) replace "latestFirst" option with alternatives, which
no longer
>>>>>>>>> read in "backward" order - this doesn't say we have to
read all files to
>>>>>>>>> move forward. As we do with Kafka, start offset can be
provided, ideally as
>>>>>>>>> a timestamp, which Spark will read from such timestamp
and forward order.
>>>>>>>>> This doesn't cover all use cases of "latestFirst", but
>>>>>>>>> doesn't seem to be natural with the concept of SS (think
about watermark),
>>>>>>>>> I'd prefer to support alternatives instead of struggling
with "latestFirst".
>>>>>>>>> Would like to hear your opinions.
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)

View raw message