kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre DOR <p...@amadeus.com>
Subject Detecting end of distributed batch processing within a streaming graph
Date Thu, 01 Mar 2018 19:03:36 GMT
Hello all,

This question is not strictly related to Kafka but rather to a streaming design using Kafka.
Hope it still stays within the scope of this list.

I would like to distribute the processing of a monolithic batch within a streaming DAG, so
with multiple parallel branches, each branch being composed of one or several streaming micro-services
doing tasks in sequence, and communication between each micro-service being done through Kafka.
The problem is to detect instantaneously that all subtasks have been processed (successfully
or not) within the DAG, thing that is rather trivial in the monolithic design.

To solve this I’m thinking about a special “control topic”, responsible for carrying
end processing statuses, and a specific micro-service responsible for handling those processing
statuses and for eventually reporting proactively the batch end event.
Knowing that there will be process, let’s call it “the feeder”, injecting all batch
subtasks in the primary input topic of the DAG, the idea in a nutshell would be:
- To have the feeder generating first a START <batchID> event in the control topic,
so that the “controller” micro-service initializes a state for the given batchID.
- To have the feeder injecting in sequence all subtasks in the primary input topic.
- To have the feeder injecting right after a special END <batchID> event in the primary
input topic. This END event is special because instead of being sent over Kafka as any other
message, i.e. on one partition, this message would be broadcast on every topic’s partitions,
and cascaded within the DAG in the same way so that it is sure that it will go through absolutely
all partitions (and all consumers) of the global streaming graph.
- Each consumer micro-service receiving this END message (so possibly multiple times) will
report to the “control topic” that this END message has been received, together with some
IDs (batch ID and partition ID).
- The “controller” micro-service updates a counter each time it receives an END event
for a given batch ID. As the total number of partitions in the DAG is relatively static, this
number can be pre-configured in the controller. When the total number of received END events
matches the total number of partitions that means that the batch has been processed.

What do you think about this design?
Would you have more elegant ideas to solve this problem? I don’t like very much the idea
to broadcast messages on all partitions, nor to couple the configuration with the total number
of partitions, even if technically it seems to work.

Many thanks,
- Pierre

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message