spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)
Date Fri, 19 Jun 2020 00:16:06 GMT
Shall we document the known issue on file stream sink and provide
workaround? There's more than a couple of questions about this in a couple
of months, and there have been 5 related issues. The workaround Burak
provided looks nice to those who don't need to have end-to-end exactly once
semantics (and in many cases they are OK with the semantics).

On Fri, Jun 19, 2020 at 8:05 AM Burak Yavuz <brkyvz@gmail.com> wrote:

> Hi Rachana,
>
> If you don't need exactly once semantics, you can use foreachBatch to
> write your data.
> df.writeStream.foreachBatch { case (df, batchId) =>
>   df.write.mode("append").format(...).save(path)
> }
>
> However, I would highly recommend upgrading to some ACID data store
> project like Delta Lake (which natively supports streaming), Iceberg or
> Hudi.
>
> Best,
> Burak
>
> On Thu, Jun 18, 2020 at 8:24 AM Rachana Srivastava
> <rachanasrivastav@yahoo.com.invalid> wrote:
>
>> Thanks so much for your response.  I agree using Spark Streaming is not
>> recommended.  But I want a stable system we cannot have a system that
>> crashes every 5 days.  As seen in the picture below we have nearly 47 mb of
>> data in the metadata folder.  Issue is when size of data increases to
>> nearly 13 GB and driver memory is 5 GB that time we get OOM.  Not sure how
>> to add TTL to metadata, if I delete metadata then I have to delete
>> checkpoint hence loose the data.
>>
>> [image: Inline image]
>>
>>
>> On Thursday, June 18, 2020, 03:23:32 AM PDT, Jacek Laskowski <
>> jacek@japila.pl> wrote:
>>
>>
>> Hi Rachana,
>>
>> > Should I go backward and use Spark Streaming DStream based.
>>
>> No. Never. It's no longer supported (and should really be removed from
>> the codebase once and for all - dreaming...).
>>
>> Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
>> modules for batch and streaming queries, respectively.
>>
>> Please note that I'm not a PMC member or even a committer so I'm speaking
>> for myself only (not representing the project in an official way).
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
>> <rachanasrivastav@yahoo.com.invalid> wrote:
>>
>> *Structured Stream Vs Spark Steaming (DStream)?*
>>
>> Which is recommended for system stability.  Exactly once is NOT first
>> priority.  First priority is STABLE system.
>>
>> I am I need to make a decision soon.  I need help.  Here is the question
>> again.  Should I go backward and use Spark Streaming DStream based.  Write
>> our own checkpoint and go from there.  At least we never encounter these
>> metadata issues there.
>>
>> Thanks,
>>
>> Rachana
>>
>> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>
>> Just in case if anyone prefers ASF projects then there are other
>> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
>> Apache Iceberg [2]. Both are recently graduated as top level projects.
>> (DISCLAIMER: I'm not involved in both.)
>>
>> BTW it would be nice if we make the metadata implementation on file
>> stream source/sink be pluggable - from what I've seen, plugin approach has
>> been selected as the way to go whenever some part is going to be
>> complicated and it becomes arguable whether the part should be handled in
>> Spark project vs should be outside. e.g. checkpoint manager, state
>> store provider, etc. It would open up chances for the ecosystem to play
>> with the challenge "without completely re-writing the file stream source
>> and sink", focusing on scalability for metadata in a long run query.
>> Alternative projects described above will still provide more higher-level
>> features and look attractive, but sometimes it may be just "using a
>> sledgehammer to crack a nut".
>>
>> 1. https://hudi.apache.org/
>> 2. https://iceberg.apache.org/
>>
>>
>> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>> Hello Rachana,
>>
>> Getting exactly-once semantics on files and making it scale to a very
>> large number of files are very hard problems to solve. While Structured
>> Streaming + built-in file sink solves the exactly-once guarantee that
>> DStreams could not, it is definitely limited in other ways (scaling in
>> terms of files, combining batch and streaming writes in the same place,
>> etc). And solving this problem requires a holistic solution that is
>> arguably beyond the scope of the Spark project.
>>
>> There are other projects that are trying to solve this file management
>> issue. For example, Delta Lake <https://delta.io/>(full disclosure, I am
>> involved in it) was built to exactly solve this problem - get exactly-once
>> and ACID guarantees on files, but also scale to handling millions of files.
>> Please consider it as part of your solution.
>>
>>
>>
>>
>> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>> <rachanasrivastav@yahoo.com.invalid> wrote:
>>
>> I have written a simple spark structured steaming app to move data from
>> Kafka to S3. Found that in order to support exactly-once guarantee spark
>> creates _spark_metadata folder, which ends up growing too large as the
>> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
>> for a long time the metadata folder grows so big that we start getting OOM
>> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
>> and loose VALUABLE customer data.
>>
>> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
>> SPARK-24295)
>> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
>> BETTER choice?
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message