spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Smoliński <>
Subject Re: Can spark support exactly once based kafka ? Due to these following question?
Date Mon, 05 Dec 2016 09:08:54 GMT
The boundary is a bit flexible. In terms of observed DStream effective
state the direct stream semantics is exactly-once.
In terms of external system observations (like message emission), Spark
Streaming semantics is at-least-once.


On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř <> wrote:

> Hello John,
> 1. If a task complete the operation, it will notify driver. The driver may
> not receive the message due to the network, and think the task is still
> running. Then the child stage won't be scheduled ?
> Spark's fault tolerance policy is, if there is a problem in processing a
> task or an executor is lost, run the task (and any dependent tasks) again.
> Spark attempts to minimize the number of tasks it has to recompute, so
> usually only a small part of the data is recomputed.
> So in your case, the driver simply schedules the task on another executor
> and continues to the next stage when it receives the data.
> 2. how do spark guarantee the downstream-task can receive the shuffle-data
> completely. As fact, I can't find the checksum for blocks in spark. For
> example, the upstream-task may shuffle 100Mb data, but the downstream-task
> may receive 99Mb data due to network. Can spark verify the data is received
> completely based size ?
> Spark uses compression with checksuming for shuffle data so it should know
> when the data is corrupt and initiate a recomputation.
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is
> no way that I know of to ensure exactly-once. You can try to minimize
> more-than-once situations by updating your offsets as soon as possible but
> that does not eliminate the problem entirely.
> Hope this helps,
> Michal Senkyr

View raw message