spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burak Yavuz <brk...@gmail.com>
Subject Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)
Date Thu, 18 Jun 2020 23:03:19 GMT
Hi Rachana,

If you don't need exactly once semantics, you can use foreachBatch to write
your data.
df.writeStream.foreachBatch { case (df, batchId) =>
  df.write.mode("append").format(...).save(path)
}

However, I would highly recommend upgrading to some ACID data store project
like Delta Lake (which natively supports streaming), Iceberg or Hudi.

Best,
Burak

On Thu, Jun 18, 2020 at 8:24 AM Rachana Srivastava
<rachanasrivastav@yahoo.com.invalid> wrote:

> Thanks so much for your response.  I agree using Spark Streaming is not
> recommended.  But I want a stable system we cannot have a system that
> crashes every 5 days.  As seen in the picture below we have nearly 47 mb of
> data in the metadata folder.  Issue is when size of data increases to
> nearly 13 GB and driver memory is 5 GB that time we get OOM.  Not sure how
> to add TTL to metadata, if I delete metadata then I have to delete
> checkpoint hence loose the data.
>
> [image: Inline image]
>
>
> On Thursday, June 18, 2020, 03:23:32 AM PDT, Jacek Laskowski <
> jacek@japila.pl> wrote:
>
>
> Hi Rachana,
>
> > Should I go backward and use Spark Streaming DStream based.
>
> No. Never. It's no longer supported (and should really be removed from the
> codebase once and for all - dreaming...).
>
> Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
> modules for batch and streaming queries, respectively.
>
> Please note that I'm not a PMC member or even a committer so I'm speaking
> for myself only (not representing the project in an official way).
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
> <rachanasrivastav@yahoo.com.invalid> wrote:
>
> *Structured Stream Vs Spark Steaming (DStream)?*
>
> Which is recommended for system stability.  Exactly once is NOT first
> priority.  First priority is STABLE system.
>
> I am I need to make a decision soon.  I need help.  Here is the question
> again.  Should I go backward and use Spark Streaming DStream based.  Write
> our own checkpoint and go from there.  At least we never encounter these
> metadata issues there.
>
> Thanks,
>
> Rachana
>
> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
> kabhwan.opensource@gmail.com> wrote:
>
>
> Just in case if anyone prefers ASF projects then there are other
> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
> Apache Iceberg [2]. Both are recently graduated as top level projects.
> (DISCLAIMER: I'm not involved in both.)
>
> BTW it would be nice if we make the metadata implementation on file stream
> source/sink be pluggable - from what I've seen, plugin approach has been
> selected as the way to go whenever some part is going to be complicated and
> it becomes arguable whether the part should be handled in Spark project vs
> should be outside. e.g. checkpoint manager, state store provider, etc. It
> would open up chances for the ecosystem to play with the challenge "without
> completely re-writing the file stream source and sink", focusing on
> scalability for metadata in a long run query. Alternative projects
> described above will still provide more higher-level features and
> look attractive, but sometimes it may be just "using a sledgehammer to
> crack a nut".
>
> 1. https://hudi.apache.org/
> 2. https://iceberg.apache.org/
>
>
> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <tathagata.das1565@gmail.com>
> wrote:
>
> Hello Rachana,
>
> Getting exactly-once semantics on files and making it scale to a very
> large number of files are very hard problems to solve. While Structured
> Streaming + built-in file sink solves the exactly-once guarantee that
> DStreams could not, it is definitely limited in other ways (scaling in
> terms of files, combining batch and streaming writes in the same place,
> etc). And solving this problem requires a holistic solution that is
> arguably beyond the scope of the Spark project.
>
> There are other projects that are trying to solve this file management
> issue. For example, Delta Lake <https://delta.io/>(full disclosure, I am
> involved in it) was built to exactly solve this problem - get exactly-once
> and ACID guarantees on files, but also scale to handling millions of files.
> Please consider it as part of your solution.
>
>
>
>
> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
> <rachanasrivastav@yahoo.com.invalid> wrote:
>
> I have written a simple spark structured steaming app to move data from
> Kafka to S3. Found that in order to support exactly-once guarantee spark
> creates _spark_metadata folder, which ends up growing too large as the
> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
> for a long time the metadata folder grows so big that we start getting OOM
> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
> and loose VALUABLE customer data.
>
> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
> BETTER choice?
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Mime
View raw message