spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure
Date Wed, 22 Jun 2016 09:58:26 GMT

Spark Streamig does not guarantee exactly once for output action. It means that one item is
only processed in an RDD.
You can achieve at most once or at least once.
You could however do at least once (via checkpoing) and record which messages have been proceed
(some identifier available?) and do not re process them .... You could also store (safely)
what range has been already processed etc

Think about the business case if exactly once is needed or if it can be replaced by one of
the others.
Exactly once, it needed requires in any system including spark more effort and usually the
throughput is lower. A risk evaluation from a business point of view has to be done anyway...

> On 22 Jun 2016, at 09:09, sandesh deshmane <> wrote:
> Hi,
> I am writing spark streaming application which reads messages from Kafka.
> I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance .
> I have created batch size of 10 sec for reading messages from kafka.
> I read messages for kakfa and generate the count of messages as per values received from
Kafka message.
> In case there is failure and my spark streaming application is restarted I see duplicate
messages processed ( which is close to 2 batches)
> The problem that I have is per sec I get around 300k messages and In case application
is restarted I see around 3-5 million duplicate counts.
> How to avoid such duplicates?
> what is best to way to recover from such failures ?
> Thanks
> Sandesh

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message