spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)
Date Fri, 19 Jun 2020 09:02:32 GMT
Hi,

While we're at it, my basic understanding of the metadata directory is that
simply two recent compacts and the non-compact files in-between are really
necessary. Is my understanding correct?

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Fri, Jun 19, 2020 at 2:16 AM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Shall we document the known issue on file stream sink and provide
> workaround? There's more than a couple of questions about this in a couple
> of months, and there have been 5 related issues. The workaround Burak
> provided looks nice to those who don't need to have end-to-end exactly once
> semantics (and in many cases they are OK with the semantics).
>
> On Fri, Jun 19, 2020 at 8:05 AM Burak Yavuz <brkyvz@gmail.com> wrote:
>
>> Hi Rachana,
>>
>> If you don't need exactly once semantics, you can use foreachBatch to
>> write your data.
>> df.writeStream.foreachBatch { case (df, batchId) =>
>>   df.write.mode("append").format(...).save(path)
>> }
>>
>> However, I would highly recommend upgrading to some ACID data store
>> project like Delta Lake (which natively supports streaming), Iceberg or
>> Hudi.
>>
>> Best,
>> Burak
>>
>> On Thu, Jun 18, 2020 at 8:24 AM Rachana Srivastava
>> <rachanasrivastav@yahoo.com.invalid> wrote:
>>
>>> Thanks so much for your response.  I agree using Spark Streaming is not
>>> recommended.  But I want a stable system we cannot have a system that
>>> crashes every 5 days.  As seen in the picture below we have nearly 47 mb of
>>> data in the metadata folder.  Issue is when size of data increases to
>>> nearly 13 GB and driver memory is 5 GB that time we get OOM.  Not sure how
>>> to add TTL to metadata, if I delete metadata then I have to delete
>>> checkpoint hence loose the data.
>>>
>>> [image: Inline image]
>>>
>>>
>>> On Thursday, June 18, 2020, 03:23:32 AM PDT, Jacek Laskowski <
>>> jacek@japila.pl> wrote:
>>>
>>>
>>> Hi Rachana,
>>>
>>> > Should I go backward and use Spark Streaming DStream based.
>>>
>>> No. Never. It's no longer supported (and should really be removed from
>>> the codebase once and for all - dreaming...).
>>>
>>> Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
>>> modules for batch and streaming queries, respectively.
>>>
>>> Please note that I'm not a PMC member or even a committer so I'm
>>> speaking for myself only (not representing the project in an official way).
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://about.me/JacekLaskowski
>>> "The Internals Of" Online Books <https://books.japila.pl/>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> <https://twitter.com/jaceklaskowski>
>>>
>>>
>>> On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
>>> <rachanasrivastav@yahoo.com.invalid> wrote:
>>>
>>> *Structured Stream Vs Spark Steaming (DStream)?*
>>>
>>> Which is recommended for system stability.  Exactly once is NOT first
>>> priority.  First priority is STABLE system.
>>>
>>> I am I need to make a decision soon.  I need help.  Here is the question
>>> again.  Should I go backward and use Spark Streaming DStream based.  Write
>>> our own checkpoint and go from there.  At least we never encounter these
>>> metadata issues there.
>>>
>>> Thanks,
>>>
>>> Rachana
>>>
>>> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>
>>> Just in case if anyone prefers ASF projects then there are other
>>> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
>>> Apache Iceberg [2]. Both are recently graduated as top level projects.
>>> (DISCLAIMER: I'm not involved in both.)
>>>
>>> BTW it would be nice if we make the metadata implementation on file
>>> stream source/sink be pluggable - from what I've seen, plugin approach has
>>> been selected as the way to go whenever some part is going to be
>>> complicated and it becomes arguable whether the part should be handled in
>>> Spark project vs should be outside. e.g. checkpoint manager, state
>>> store provider, etc. It would open up chances for the ecosystem to play
>>> with the challenge "without completely re-writing the file stream source
>>> and sink", focusing on scalability for metadata in a long run query.
>>> Alternative projects described above will still provide more higher-level
>>> features and look attractive, but sometimes it may be just "using a
>>> sledgehammer to crack a nut".
>>>
>>> 1. https://hudi.apache.org/
>>> 2. https://iceberg.apache.org/
>>>
>>>
>>> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <
>>> tathagata.das1565@gmail.com> wrote:
>>>
>>> Hello Rachana,
>>>
>>> Getting exactly-once semantics on files and making it scale to a very
>>> large number of files are very hard problems to solve. While Structured
>>> Streaming + built-in file sink solves the exactly-once guarantee that
>>> DStreams could not, it is definitely limited in other ways (scaling in
>>> terms of files, combining batch and streaming writes in the same place,
>>> etc). And solving this problem requires a holistic solution that is
>>> arguably beyond the scope of the Spark project.
>>>
>>> There are other projects that are trying to solve this file management
>>> issue. For example, Delta Lake <https://delta.io/>(full disclosure, I
>>> am involved in it) was built to exactly solve this problem - get
>>> exactly-once and ACID guarantees on files, but also scale to handling
>>> millions of files. Please consider it as part of your solution.
>>>
>>>
>>>
>>>
>>> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>>> <rachanasrivastav@yahoo.com.invalid> wrote:
>>>
>>> I have written a simple spark structured steaming app to move data from
>>> Kafka to S3. Found that in order to support exactly-once guarantee spark
>>> creates _spark_metadata folder, which ends up growing too large as the
>>> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
>>> for a long time the metadata folder grows so big that we start getting OOM
>>> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
>>> and loose VALUABLE customer data.
>>>
>>> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
>>> SPARK-24295)
>>> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
>>> BETTER choice?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message