spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <>
Subject Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)
Date Fri, 19 Jun 2020 09:02:32 GMT

While we're at it, my basic understanding of the metadata directory is that
simply two recent compacts and the non-compact files in-between are really
necessary. Is my understanding correct?

Jacek Laskowski
"The Internals Of" Online Books <>
Follow me on


On Fri, Jun 19, 2020 at 2:16 AM Jungtaek Lim <>

> Shall we document the known issue on file stream sink and provide
> workaround? There's more than a couple of questions about this in a couple
> of months, and there have been 5 related issues. The workaround Burak
> provided looks nice to those who don't need to have end-to-end exactly once
> semantics (and in many cases they are OK with the semantics).
> On Fri, Jun 19, 2020 at 8:05 AM Burak Yavuz <> wrote:
>> Hi Rachana,
>> If you don't need exactly once semantics, you can use foreachBatch to
>> write your data.
>> df.writeStream.foreachBatch { case (df, batchId) =>
>>   df.write.mode("append").format(...).save(path)
>> }
>> However, I would highly recommend upgrading to some ACID data store
>> project like Delta Lake (which natively supports streaming), Iceberg or
>> Hudi.
>> Best,
>> Burak
>> On Thu, Jun 18, 2020 at 8:24 AM Rachana Srivastava
>> <> wrote:
>>> Thanks so much for your response.  I agree using Spark Streaming is not
>>> recommended.  But I want a stable system we cannot have a system that
>>> crashes every 5 days.  As seen in the picture below we have nearly 47 mb of
>>> data in the metadata folder.  Issue is when size of data increases to
>>> nearly 13 GB and driver memory is 5 GB that time we get OOM.  Not sure how
>>> to add TTL to metadata, if I delete metadata then I have to delete
>>> checkpoint hence loose the data.
>>> [image: Inline image]
>>> On Thursday, June 18, 2020, 03:23:32 AM PDT, Jacek Laskowski <
>>>> wrote:
>>> Hi Rachana,
>>> > Should I go backward and use Spark Streaming DStream based.
>>> No. Never. It's no longer supported (and should really be removed from
>>> the codebase once and for all - dreaming...).
>>> Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
>>> modules for batch and streaming queries, respectively.
>>> Please note that I'm not a PMC member or even a committer so I'm
>>> speaking for myself only (not representing the project in an official way).
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> "The Internals Of" Online Books <>
>>> Follow me on
>>> <>
>>> On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava
>>> <> wrote:
>>> *Structured Stream Vs Spark Steaming (DStream)?*
>>> Which is recommended for system stability.  Exactly once is NOT first
>>> priority.  First priority is STABLE system.
>>> I am I need to make a decision soon.  I need help.  Here is the question
>>> again.  Should I go backward and use Spark Streaming DStream based.  Write
>>> our own checkpoint and go from there.  At least we never encounter these
>>> metadata issues there.
>>> Thanks,
>>> Rachana
>>> On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <
>>>> wrote:
>>> Just in case if anyone prefers ASF projects then there are other
>>> alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
>>> Apache Iceberg [2]. Both are recently graduated as top level projects.
>>> (DISCLAIMER: I'm not involved in both.)
>>> BTW it would be nice if we make the metadata implementation on file
>>> stream source/sink be pluggable - from what I've seen, plugin approach has
>>> been selected as the way to go whenever some part is going to be
>>> complicated and it becomes arguable whether the part should be handled in
>>> Spark project vs should be outside. e.g. checkpoint manager, state
>>> store provider, etc. It would open up chances for the ecosystem to play
>>> with the challenge "without completely re-writing the file stream source
>>> and sink", focusing on scalability for metadata in a long run query.
>>> Alternative projects described above will still provide more higher-level
>>> features and look attractive, but sometimes it may be just "using a
>>> sledgehammer to crack a nut".
>>> 1.
>>> 2.
>>> On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <
>>>> wrote:
>>> Hello Rachana,
>>> Getting exactly-once semantics on files and making it scale to a very
>>> large number of files are very hard problems to solve. While Structured
>>> Streaming + built-in file sink solves the exactly-once guarantee that
>>> DStreams could not, it is definitely limited in other ways (scaling in
>>> terms of files, combining batch and streaming writes in the same place,
>>> etc). And solving this problem requires a holistic solution that is
>>> arguably beyond the scope of the Spark project.
>>> There are other projects that are trying to solve this file management
>>> issue. For example, Delta Lake <>(full disclosure, I
>>> am involved in it) was built to exactly solve this problem - get
>>> exactly-once and ACID guarantees on files, but also scale to handling
>>> millions of files. Please consider it as part of your solution.
>>> On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
>>> <> wrote:
>>> I have written a simple spark structured steaming app to move data from
>>> Kafka to S3. Found that in order to support exactly-once guarantee spark
>>> creates _spark_metadata folder, which ends up growing too large as the
>>> streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs
>>> for a long time the metadata folder grows so big that we start getting OOM
>>> errors. Only way to resolve OOM is delete Checkpoint and Metadata folder
>>> and loose VALUABLE customer data.
>>> Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and
>>> SPARK-24295)
>>> Since Spark Streaming was NOT broken like this. Is Spark Streaming a
>>> BETTER choice?
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail:

View raw message