spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source
Date Mon, 28 Sep 2020 01:11:19 GMT
bump to see anyone interested or concerned about this.

On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Bump this again.
>
> On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim <
> kabhwan.opensource@gmail.com> wrote:
>
>> Bump again.
>>
>> Unlike file stream sink which has lots of limitations and many of us have
>> been suggesting alternatives, file stream source is the only way if end
>> users want to read the data from files. No alternative unless they
>> introduce another ETL & storage (probably Kafka).
>>
>> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Hi German,
>>>
>>> option 1 isn't about "deleting" the old files, as your input directory
>>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>>> input data hence possible to apply retention without problem.
>>> option 1 is more about "hiding" the old files being read, so that end
>>> users "may" be able to delete the files once they ensure "all queries
>>> accessing the input directory" don't see the old files.
>>>
>>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon <
>>> gschiavonspark@gmail.com> wrote:
>>>
>>>> HI Jungtaek,
>>>>
>>>> I have a question, aren't both approaches compatible?
>>>>
>>>> How I see it, I think It would be interesting to have a retention
>>>> period to delete old files and/or the possibility of indicating an offset
>>>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>>>
>>>> WDYT?
>>>>
>>>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim <
>>>> kabhwan.opensource@gmail.com> wrote:
>>>>
>>>>> (I'd like to keep the discussion thread focusing on the specific topic
>>>>> - let's initiate another discussion threads on different topics.)
>>>>>
>>>>> Thanks for the input. I'd like to emphasize that the point in
>>>>> discussion is the "latestFirst" option - the rationalization starts from
>>>>> growing metadata log issues. I hope your input is picking option 2, but
>>>>> could you please make clear your input represents OK to "replace" the
>>>>> "latestFirst" option with "starting from timestamp"?
>>>>>
>>>>>
>>>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
>>>>> vikram.agrawal@gmail.com> wrote:
>>>>>
>>>>>> If we compare file-stream source with other streaming sources such
as
>>>>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>>>>> from a custom offset/particular point of time is something that is
missing.
>>>>>> Typically filestream sources don't have auto-deletion of the older
>>>>>> data/files. In kafka we can define the retention period. So even
if we use
>>>>>> "Earliest" we won't end up reading from the time when the Kafka topic
was
>>>>>> created. On the other hand, streaming sources can hold very old files.
It's
>>>>>> very valid use-cases to read the bulk of the old files using a batch
job
>>>>>> until a particular timestamp. And then use streaming jobs for real-time
>>>>>> updates.
>>>>>>
>>>>>> So having support where we can specify a timestamp. and we would
>>>>>> consider files created post that timestamp can be useful.
>>>>>>
>>>>>> Another concern which we need to consider is the listing cost. is
>>>>>> there any way we can avoid listing the entire base directory and
then
>>>>>> filtering out the new files. if the data is organized as partitions
using
>>>>>> date, will it help to list only those partitions where new files
were
>>>>>> added?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> bump, is there any interest on this topic?
>>>>>>>
>>>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> (Just to add rationalization, you can refer the original
mail
>>>>>>>> thread on dev@ list to see efforts on addressing problems
in file
>>>>>>>> stream source / sink -
>>>>>>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>>>>>>>> )
>>>>>>>>
>>>>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>>>>>>> kabhwan.opensource@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi devs,
>>>>>>>>>
>>>>>>>>> As I have been going through the various issues on metadata
log
>>>>>>>>> growing, it's not only the issue of sink, but also the
issue of source.
>>>>>>>>> Unlike sink metadata log which entries should be available
to the
>>>>>>>>> readers, the source metadata log is only for the streaming
query starting
>>>>>>>>> from the checkpoint, hence in theory it should only memorize
about
>>>>>>>>> minimal entries which prevent processing multiple times
on the same file.
>>>>>>>>>
>>>>>>>>> This is not applied to the file stream source, and I
think it's
>>>>>>>>> because of the existence of the "latestFirst" option
which I haven't seen
>>>>>>>>> from any sources. The option works as reading files in
"backward" order,
>>>>>>>>> which means Spark can read the oldest file and latest
file together in a
>>>>>>>>> micro-batch, which ends up having to memorize all files
previously read.
>>>>>>>>> The option can be changed during query restart, so even
if the query is
>>>>>>>>> started with "latestFirst" being false, it's not safe
to apply the logic of
>>>>>>>>> minimizing entries to memorize, as the option can be
changed to true and
>>>>>>>>> then we'll read files again.
>>>>>>>>>
>>>>>>>>> I'm seeing two approaches here:
>>>>>>>>>
>>>>>>>>> 1) apply "retention" - unlike "maxFileAge", the option
would apply
>>>>>>>>> to latestFirst as well. That said, if the retention is
set to 7 days, the
>>>>>>>>> files older than 7 days would never be read in any way.
With this approach
>>>>>>>>> we can at least get rid of entries which are older than
retention. The
>>>>>>>>> issue is how to play nicely with existing "maxFileAge",
as it also plays
>>>>>>>>> similar with the retention, though it's being ignored
when latestFirst is
>>>>>>>>> turned on. (Change the semantic of "maxFileAge" vs leave
it to "soft
>>>>>>>>> retention" and introduce another option.)
>>>>>>>>>
>>>>>>>>> (This approach is being proposed under SPARK-17604, and
PR is
>>>>>>>>> available - https://github.com/apache/spark/pull/28422)
>>>>>>>>>
>>>>>>>>> 2) replace "latestFirst" option with alternatives, which
no longer
>>>>>>>>> read in "backward" order - this doesn't say we have to
read all files to
>>>>>>>>> move forward. As we do with Kafka, start offset can be
provided, ideally as
>>>>>>>>> a timestamp, which Spark will read from such timestamp
and forward order.
>>>>>>>>> This doesn't cover all use cases of "latestFirst", but
"latestFirst"
>>>>>>>>> doesn't seem to be natural with the concept of SS (think
about watermark),
>>>>>>>>> I'd prefer to support alternatives instead of struggling
with "latestFirst".
>>>>>>>>>
>>>>>>>>> Would like to hear your opinions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>
>>>>>>>>

Mime
View raw message