samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kishore N C <>
Subject Detecting "done" on a bounded input dataset
Date Wed, 14 Oct 2015 16:33:45 GMT

Our data processing pipeline consists of a set of Samza jobs, that form a
DAG. Sometimes, we have to throw finite datasets into the Kafka topic that
acts as the entry point to the pipeline. Given that different Samza jobs in
the DAG could have varying latencies in terms of processing the records (or
could even temporarily fails or be stuck), how do I detect that my assembly
of jobs have finished processing all records? It's not as simple as
tallying the input and output record counts, as some jobs could be
filtering data, and others could be grouping records etc.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message