spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure
Date Wed, 22 Jun 2016 15:59:18 GMT

That is the cost of exactly once :) 

> On 22 Jun 2016, at 12:54, sandesh deshmane <sandesh.vjti@gmail.com> wrote:
> 
> We are going with checkpointing . we don't have identifier available to identify if the
message is already processed or not .
> Even if we had it, then it will slow down the processing as we do get 300k messages per
sec , so lookup will slow down.
> 
> Thanks
> Sandesh
> 
>> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>> 
>> Spark Streamig does not guarantee exactly once for output action. It means that one
item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which messages have
been proceed (some identifier available?) and do not re process them .... You could also store
(safely) what range has been already processed etc
>> 
>> Think about the business case if exactly once is needed or if it can be replaced
by one of the others.
>> Exactly once, it needed requires in any system including spark more effort and usually
the throughput is lower. A risk evaluation from a business point of view has to be done anyway...
>> 
>> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.vjti@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance
.
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per values received
from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is restarted I see
duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In case application
is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
> 

Mime
View raw message