spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From muru <mmur...@gmail.com>
Subject Re: ForeachBatch Structured Streaming
Date Thu, 15 Oct 2020 23:31:19 GMT
To achieve exactly-once with foreachBatch in SS, you must have a checkpoint
enabled. In case of any exceptions or failures the spark SS job will get
restarted and the same batchID reprocessed again (for any data sources). To
avoid duplicates, you should have an external system to store and dedupe
the same batchIds.

On Wed, Oct 14, 2020 at 12:11 AM German Schiavon <gschiavonspark@gmail.com>
wrote:

> Hi!
>
> In the documentation it says:
>
>
>    - By default, foreachBatch provides only at-least-once write
>    guarantees. However, you can use the batchId provided to the function as
>    way to deduplicate the output and get an exactly-once guarantee.
>
>
> Taking the example snippet :
>
>
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   batchDF.persist()
>   batchDF.write.format(...).save(...)  // location 1
>   batchDF.write.format(...).save(...)  // location 2
>   batchDF.unpersist()}
>
>
> Let's assume I'm reading from Kafka, that means that by default *batchDF *may
> or may not have duplicates?
>
> Thanks!
>
>
>
>
>
>

Mime
View raw message