spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From German Schiavon <gschiavonsp...@gmail.com>
Subject ForeachBatch Structured Streaming
Date Wed, 14 Oct 2020 07:10:29 GMT
Hi!

In the documentation it says:


   - By default, foreachBatch provides only at-least-once write guarantees.
   However, you can use the batchId provided to the function as way to
   deduplicate the output and get an exactly-once guarantee.


Taking the example snippet :


streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  batchDF.persist()
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  batchDF.unpersist()}


Let's assume I'm reading from Kafka, that means that by default *batchDF *may
or may not have duplicates?

Thanks!

Mime
View raw message