spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Šenkýř <bina...@gmail.com>
Subject Re: Can spark support exactly once based kafka ? Due to these following question?
Date Mon, 05 Dec 2016 07:59:36 GMT
Hello John,

> 1. If a task complete the operation, it will notify driver. The driver 
> may not receive the message due to the network, and think the task is 
> still running. Then the child stage won't be scheduled ?

Spark's fault tolerance policy is, if there is a problem in processing a 
task or an executor is lost, run the task (and any dependent tasks) 
again. Spark attempts to minimize the number of tasks it has to 
recompute, so usually only a small part of the data is recomputed.

So in your case, the driver simply schedules the task on another 
executor and continues to the next stage when it receives the data.

> 2. how do spark guarantee the downstream-task can receive the 
> shuffle-data completely. As fact, I can't find the checksum for blocks 
> in spark. For example, the upstream-task may shuffle 100Mb data, but 
> the downstream-task may receive 99Mb data due to network. Can spark 
> verify the data is received completely based size ?

Spark uses compression with checksuming for shuffle data so it should 
know when the data is corrupt and initiate a recomputation.

As for your question in the subject:
All of this means that Spark supports at-least-once processing. There is 
no way that I know of to ensure exactly-once. You can try to minimize 
more-than-once situations by updating your offsets as soon as possible 
but that does not eliminate the problem entirely.

Hope this helps,

Michal Senkyr


Mime
View raw message