spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rachana Srivastava <rachanasrivas...@yahoo.com.INVALID>
Subject Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)
Date Fri, 19 Jun 2020 10:55:26 GMT
 Hello Burak,
Thanks for your response.  We need exactly once but not at the cost of stability of the system.  
Is "DataLake" you next gen data pipeline solution?   Looks like it is a ACID compliant datalake...
I am asking this because I want to create a ML pipeline using Spark.  Not sure if datalake
is meant for that.
Thanks
Rachana
    On Thursday, June 18, 2020, 04:03:55 PM PDT, Burak Yavuz <brkyvz@gmail.com> wrote:
 
 
 Hi Rachana,
If you don't need exactly once semantics, you can use foreachBatch to write your data.df.writeStream.foreachBatch
{ case (df, batchId) =>  df.write.mode("append").format(...).save(path)}
However, I would highly recommend upgrading to some ACID data store project like Delta Lake
(which natively supports streaming), Iceberg or Hudi. 
Best,Burak
On Thu, Jun 18, 2020 at 8:24 AM Rachana Srivastava <rachanasrivastav@yahoo.com.invalid>
wrote:

 Thanks so much for your response.  I agree using Spark Streaming is not recommended.  But
I want a stable system we cannot have a system that crashes every 5 days.  As seen in the
picture below we have nearly 47 mb of data in the metadata folder.  Issue is when size of
data increases to nearly 13 GB and driver memory is 5 GB that time we get OOM.  Not sure
how to add TTL to metadata, if I delete metadata then I have to delete checkpoint hence loose
the data.  



    On Thursday, June 18, 2020, 03:23:32 AM PDT, Jacek Laskowski <jacek@japila.pl> wrote:
 
 
 Hi Rachana,
> Should I go backward and use Spark Streaming DStream based.
No. Never. It's no longer supported (and should really be removed from the codebase once
and for all - dreaming...).
Spark focuses on Spark SQL and Spark Structured Streaming as user-facing modules for batch
and streaming queries, respectively.
Please note that I'm not a PMC member or even a committer so I'm speaking for myself only
(not representing the project in an official way).
Pozdrawiam,Jacek Laskowski----https://about.me/JacekLaskowski"The Internals Of" Online Books
Follow me on https://twitter.com/jaceklaskowski



On Thu, Jun 18, 2020 at 12:03 AM Rachana Srivastava <rachanasrivastav@yahoo.com.invalid>
wrote:

 Structured Stream Vs Spark Steaming (DStream)?
Which is recommended for system stability.  Exactly once is NOT first priority.  First priority
is STABLE system.
I am I need to make a decision soon.  I need help.  Here is the question again.  Should
I go backward and use Spark Streaming DStream based.  Write our own checkpoint and go from
there.  At least we never encounter these metadata issues there.
Thanks,
Rachana
    On Wednesday, June 17, 2020, 02:02:20 PM PDT, Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:  
 
 Just in case if anyone prefers ASF projects then there are other alternative projects in
ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg [2]. Both are recently graduated
as top level projects. (DISCLAIMER: I'm not involved in both.)
BTW it would be nice if we make the metadata implementation on file stream source/sink be
pluggable - from what I've seen, plugin approach has been selected as the way to go whenever
some part is going to be complicated and it becomes arguable whether the part should be handled
in Spark project vs should be outside. e.g. checkpoint manager, state store provider, etc.
It would open up chances for the ecosystem to play with the challenge "without completely
re-writing the file stream source and sink", focusing on scalability for metadata in a long
run query. Alternative projects described above will still provide more higher-level features
and look attractive, but sometimes it may be just "using a sledgehammer to crack a nut".
1. https://hudi.apache.org/2. https://iceberg.apache.org/


On Thu, Jun 18, 2020 at 2:34 AM Tathagata Das <tathagata.das1565@gmail.com> wrote:

Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large number of files
are very hard problems to solve. While Structured Streaming + built-in file sink solves the
exactly-once guarantee that DStreams could not, it is definitely limited in other ways (scaling
in terms of files, combining batch and streaming writes in the same place, etc). And solving
this problem requires a holistic solution that is arguably beyond the scope of the Spark project. 
There are other projects that are trying to solve this file management issue. For example,
Delta Lake (full disclosure, I am involved in it) was built to exactly solve this problem
- get exactly-once and ACID guarantees on files, but also scale to handling millions of files.
Please consider it as part of your solution. 



On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava <rachanasrivastav@yahoo.com.invalid>
wrote:

I have written a simple spark structured steaming app to move data from Kafka to S3. Found
that in order to support exactly-once guarantee spark creates _spark_metadata folder, which
ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming
app runs for a long time the metadata folder grows so big that we start getting OOM errors.
Only way to resolve OOM is delete Checkpoint and Metadata folder and loose VALUABLE customer
data.

Spark open JIRAs SPARK-24295 and SPARK-29995, SPARK-30462, and SPARK-24295)
Since Spark Streaming was NOT broken like this. Is Spark Streaming a BETTER choice?

  
  
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
  
Mime
View raw message