spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandesh deshmane <>
Subject how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure
Date Wed, 22 Jun 2016 07:09:45 GMT

I am writing spark streaming application which reads messages from Kafka.

I am using checkpointing and write ahead logs ( WAL) to achieve fault
tolerance .

I have created batch size of 10 sec for reading messages from kafka.

I read messages for kakfa and generate the count of messages as per values
received from Kafka message.

In case there is failure and my spark streaming application is restarted I
see duplicate messages processed ( which is close to 2 batches)

The problem that I have is per sec I get around 300k messages and In case
application is restarted I see around 3-5 million duplicate counts.

How to avoid such duplicates?

what is best to way to recover from such failures ?


View raw message