samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yi Pan <nickpa...@gmail.com>
Subject Re: Data consistency and check-pointing
Date Tue, 19 Jan 2016 03:15:05 GMT
Hi, Michael,

Your use case sounds much like a "customized checkpointing" to me. We have
similar cases in LinkedIn and the following are the solution in production:
1) disable Samza auto-checkpoint by setting the commit_ms to -1
2) explicitly calling TaskCoordinator.commit() in sync with closing the
transaction batch

The above procedure works well and gives user to ability to control the
commit of checkpoint together w/ your transaction batch. In case of system
crash between the closing of transaction batch and the checkpoint commit (I
am assuming this sequence of actions), we would follow the at-least-once
semantics and re-play the messages from the last commit.

Please let us know whether that satisfies your use case.

Thanks!

-Yi

On Sun, Jan 17, 2016 at 11:09 AM, Michael Sklyar <mikeskali@gmail.com>
wrote:

> Hi,
>
> We have a Samza job reading messages from Kafka and inserting to hive via
> the Hive Streaming API. With Hive Streaming we are using
> "TransactionBatch", closing the Transaction batch closes the file on HDFS.
> We close the transaction batch after reaching the a. Maximum messages per
> transaction batch or b. time threshold (for example - every 20K messages or
> every 10 seconds).
>
> It works well, but in cases the job will terminate in the middle of a
> transaction batch we will have data inconsistency in hive, either:
>
> 1. Duplication: Data that was already inserted to hive will be processed
> again (since the checkpoint was taken earlier than the latest message
> written to hive).
>
>
>
>
>
>
>
>
>
> 2. Missing Data: Messages that were not committed to hive yet will not be
> reprocessed (since the checkpoint was written after
>
> What would be the recommended method of synchronizing hive/hdfs insertion
> with Samza checkpointing? I am thinking of overriding the
> *KafkaCheckpointManager* & *KafkaCheckpointManagerFactory* and
> synchronize check-pointing with
> committing the data to hive. Is it a good idea?
>
> Thanks in advance,
> Michael Sklyar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message