Hi All,

 

We are facing a peculiar issues in our application which uses Apache storm for processing and Microsoft EventHub as sender.

Background information:

We have Event hub with 4 partition and using Microsoft provided EventHubSpout in our Storm topology.

We are currently doing historical load. So, In an hour we get 50K messages to the eventhub and storm needs to process it.

We scaled our Storm cluster to 18 nodes (4 core each) to support our requirement.

                We have 68 Workers for an topology for which we getting the peculiar issue.

                We keep the name of the topology same after every restarts.

 

I have below concerns, please rectify it.

1.       Is it true we need to match number of worker with number of partition in the eventHub?

for eg., in our example, number of worker will be 4 since we have 4 partition event hub.

If so, what issue it will cause if number of worker is more than eventhub partition?.

will it create the above duplication issue we are facing now?.

We are acking the tuples in our bolts correctly. Because, the Storm didn’t pick the old message after it process once.

It will re-process the old message only after the restart of the topology.


2.       Is it true, it will give good processing power if the number of worker is equal to supervisor nodes?

For eg.. we have 18 nodes, which means we can have 18 workers.

Currently, we are keeping 1 worker per slot .

for eg.. we have 18 nodes (4 slots each), so we kept 72 workers.

--
Thank You
Hari Hara Sudhan R