samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sklyar <mikesk...@gmail.com>
Subject Re: Data consistency and check-pointing
Date Tue, 19 Jan 2016 11:11:30 GMT
Hi Yi!
I just figured it out yesterday and was about to send an update:)

Yes, it covers our use case perfectly.

Thanks,
Michael

On Tue, Jan 19, 2016 at 5:15 AM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, Michael,
>
> Your use case sounds much like a "customized checkpointing" to me. We have
> similar cases in LinkedIn and the following are the solution in production:
> 1) disable Samza auto-checkpoint by setting the commit_ms to -1
> 2) explicitly calling TaskCoordinator.commit() in sync with closing the
> transaction batch
>
> The above procedure works well and gives user to ability to control the
> commit of checkpoint together w/ your transaction batch. In case of system
> crash between the closing of transaction batch and the checkpoint commit (I
> am assuming this sequence of actions), we would follow the at-least-once
> semantics and re-play the messages from the last commit.
>
> Please let us know whether that satisfies your use case.
>
> Thanks!
>
> -Yi
>
> On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar <mikeskali@gmail.com>
> wrote:
>
> > Hi,
> >
> > We have a Samza job reading messages from Kafka and inserting to hive via
> > the Hive Streaming API. With Hive Streaming we are using
> > "TransactionBatch", closing the Transaction batch closes the file on
> HDFS.
> > We close the transaction batch after reaching the a. Maximum messages per
> > transaction batch or b. time threshold (for example - every 20K messages
> or
> > every 10 seconds).
> >
> > It works well, but in cases the job will terminate in the middle of a
> > transaction batch we will have data inconsistency in hive, either:
> >
> > 1. Duplication: Data that was already inserted to hive will be processed
> > again (since the checkpoint was taken earlier than the latest message
> > written to hive).
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 2. Missing Data: Messages that were not committed to hive yet will not be
> > reprocessed (since the checkpoint was written after
> >
> > What would be the recommended method of synchronizing hive/hdfs insertion
> > with Samza checkpointing? I am thinking of overriding the
> > *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and
> > synchronize check-pointing with
> > committing the data to hive. Is it a good idea?
> >
> > Thanks in advance,
> > Michael Sklyar
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message