storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alberto São Marcos <alberto....@gmail.com>
Subject Re: How to guarantee exactly-once on Kafka messages processing
Date Sat, 11 Jun 2016 13:22:53 GMT
Like i said Francisco, as I see it idempotent workflows are easier to
achieve. The amount of coordination for *trying* to do exactly once is just
not worth it.
On Jun 9, 2016 23:18, "Francisco Lopes" <chico.lopes@gmail.com> wrote:

> Alberto,
>
> But isn't that exactly what Trident is supposed to do?
>
> http://storm.apache.org/releases/current/Trident-state.html
>
> After reading more about this subject, it seems Trident gives me the
> exactly once semantics for free:
>
> You've seen the intricacies of what it takes to achieve exactly-once
>> semantics. The nice thing about Trident is that it internalizes all the
>> fault-tolerance logic within the State – as a user you don't have to deal
>> with comparing txids, storing multiple values in the database, or anything
>> like that.
>>
>
> From that I understand that Trident will take care of it all for me,
> storing just the txid somewhere for each tuple, as described as an example
> here:
>
> You've already seen that storing just the count as the value isn't
>> sufficient to know whether you've processed a batch of tuples before.
>> Instead, what you can do is store the transaction id with the count in the
>> database as an atomic value. Then, when updating the count, you can just
>> compare the transaction id in the database with the transaction id for the
>> current batch. If they're the same, you skip the update – because of the
>> strong ordering, you know for sure that the value in the database
>> incorporates the current batch. If they're different, you increment the
>> count. This logic works because the batch for a txid never changes, and
>> Trident ensures that state updates are ordered among batches.
>>
>
> I'm wondering though where does Trident these internal txids. Would it be
> in memory? Or Zookeeper, maybe?
>
> Regards,
> Francisco
>
>
> On Thu, Jun 9, 2016 at 6:23 AM, Alberto São Marcos <alberto.csm@gmail.com>
> wrote:
>
>> Trying to do "exactly once"  in distributed systems is not the easiest
>> or safest path Francisco. Making the downstream workflow idempotent is
>> usually a lot easier and prolly a much robust solution.
>>
>> On Wed, Jun 8, 2016 at 3:02 PM, Francisco Lopes <chico.lopes@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm new to Storm/Trident and I'm using it to read messages from Kafka
>>> and send them to an external API exactly once. My topology is as simple as:
>>>
>>> IPartitionedTridentSpout kafkaSpout = getKafkaSpout();
>>> TridentTopology topology = new TridentTopology();
>>>
>>> topology
>>>     .newStream("kafka", kafkaSpout)
>>>     .each(new Fields("str"), new Processor(), new Fields());
>>>
>>> I'm not sure how I should implement state to guarantee a message is not
>>> processed twice.
>>>
>>> Can anyone please enlighten me?
>>>
>>> All the examples I found show states as counts or sums and that's not
>>> what I really need. I'm inclined to use a Redis instance to store state (
>>> https://github.com/kstyrc/trident-redis), but I don't know on what I'd
>>> actually persistentAggregate.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Francisco
>>>
>>
>>
>

Mime
View raw message