spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Can spark support exactly once based kafka ? Due to these following question?
Date Mon, 05 Dec 2016 10:17:19 GMT
You need to do the book keeping of what has been processed yourself. This may mean roughly
the following (of course the devil is in the details):
Write down in zookeeper which part of the processing job has been done and for which dataset
all the data has been created (do not keep the data itself in zookeeper).
Once you start a processing job, check in zookeeper if it has been processed, if not remove
all staging data, if yes terminate. 

As I said the details depend on your job and require some careful thinking, but exactly once
can be achieved with Spark (and potentially zookeeper or similar, such as Redis).
Of course at the same time think if you need delivery in order etc.

> On 5 Dec 2016, at 08:59, Michal Šenkýř <binarek@gmail.com> wrote:
> 
> Hello John,
> 
>> 1. If a task complete the operation, it will notify driver.               The driver
may not receive the message due to the network, and think the task is still running. Then
the child stage won't be scheduled ?
> Spark's fault tolerance policy is, if there is a problem in processing a task or an executor
is lost, run the task (and any dependent tasks) again. Spark attempts to minimize the number
of tasks it has to recompute, so usually only a small part of the data is recomputed.
> 
> So in your case, the driver simply schedules the task on another executor and continues
to the next stage when it receives the data.
>> 2. how do spark guarantee the downstream-task can receive the shuffle-data completely.
As fact, I can't find the checksum for blocks in spark. For example, the upstream-task may
shuffle 100Mb data, but the downstream-task may receive 99Mb data due to network. Can spark
verify the data is received completely based size ?
> Spark uses compression with checksuming for shuffle data so it should know when the data
is corrupt and initiate a recomputation.
> 
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is no way that
I know of to ensure exactly-once. You can try to minimize more-than-once situations by updating
your offsets as soon as possible but that does not eliminate the problem entirely.
> 
> Hope this helps,
> Michal Senkyr

Mime
View raw message