storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Topology is stuck
Date Wed, 09 Apr 2014 22:03:53 GMT
In what sense do you mean when you say that reads in ZK are eventually
consistent?

You may get a slightly old value, but you are guaranteed to see a
consistent history.  That is, if a value has values (which include version
numbers) v_1 ... v_n, then if you see v_i, you will never see v_j where j<i.

You can also guarantee that you don't even see delayed values by using sync.

Normally when people say "eventually consistent" they mean that two
participants can see inconsistent histories under partition.  That isn't
possible in ZK.  As I understand it, ZK would be better described as
providing sequential consistency since all observers will see all updates
in the same order.




On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <jasonjckn@gmail.com> wrote:

> I have one theory that because reads in zookeeper are eventually
> consistent, this is a necessary condition for the bug to manifest. So one
> way to test this hypothesis is to run a zookeeper ensemble with 1 node, or
> a zookeeper ensemble configured for 5 nodes, but take 2 of them offline, so
> that every write operation only succeeds if every member of the ensemble
> sees the write. This should produce strong consistent reads. If you run
> this test, let me know what the results are. (Clearly this isn't a good
> production system though as you're making a tradeoff for lower availability
> in favor of greater consistency, but the results could help narrow down the
> bug)
>
>
> On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <jasonjckn@gmail.com> wrote:
>
>> Yah it's probably a bug in trident. It would be amazing if someone
>> figured out the fix for this. I spent about 6 hours looking into, but
>> couldn't figure out why it was occuring.
>>
>> Beyond fixing this, one thing you could do to buy yourself time is
>> disable batch retries in trident. There's no option for this in the API,
>> but it's like a 1 or 2 line change to the code. Obviously you loose exactly
>> once semantics, but at least you would have a system that never falls
>> behind real-time.
>>
>>
>>
>> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi <
>> danijel@schiavuzzi.com> wrote:
>>
>>> Thanks Jason. However, I don't think that was the case in my stuck
>>> topology, otherwise I'd have seen exceptions (thrown by my Trident
>>> functions) in the worker logs.
>>>
>>>
>>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <jasonjckn@gmail.com>wrote:
>>>
>>>> An example of "corrupted input" that causes a batch to fail would be
>>>> for example if you expected a schema to your data that you read off kafka,
>>>> or some queue, and for whatever reason the data didn't conform to your
>>>> schema and the function that you implement that you pass to stream.each
>>>> throws an exception when this unexpected situation occurs. This would cause
>>>> the batch to be retried, but it's deterministically failing, so the batch
>>>> will be retried forever.
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi <
>>>> danijel@schiavuzzi.com> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> Could you be more specific -- what do you mean by "corrupted input"?
>>>>> Do you mean that there's a bug in Trident itself that causes the tuples
in
>>>>> a batch to somehow become corrupted?
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Monday, April 7, 2014, Jason Jackson <jasonjckn@gmail.com> wrote:
>>>>>
>>>>>> This could happen if you have corrupted input that always causes
a
>>>>>> batch to fail and be retried.
>>>>>>
>>>>>> I have seen this behaviour before and I didn't see corrupted input.
>>>>>> It might be a bug in trident, I'm not sure. If you figure it out
please
>>>>>> update this thread and/or submit a patch.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>> To (partially) answer my own question -- I still have no idea on
the
>>>>>> cause of the stuck topology, but re-submitting the topology helps
-- after
>>>>>> re-submitting my topology is now running normally.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>> Also, I did have multiple cases of my IBackingMap workers dying
>>>>>> (because of RuntimeExceptions) but successfully restarting afterwards
(I
>>>>>> throw RuntimeExceptions in the BackingMap implementation as my strategy
in
>>>>>> rare SQL database deadlock situations to force a worker restart and
to
>>>>>> fail+retry the batch).
>>>>>>
>>>>>> From the logs, one such IBackingMap worker death (and subsequent
>>>>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>>>>
>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>>>>> batch, attempt 29698959:736
>>>>>>
>>>>>> This is of course the normal behavior of a transactional topology,
>>>>>> but this is the first time I've encountered a case of a batch retrying
>>>>>> indefinitely. This is especially suspicious since the topology has
been
>>>>>> running fine for 20 days straight, re-emitting batches and restarting
>>>>>> IBackingMap workers quite a number of times.
>>>>>>
>>>>>> I can see in my IBackingMap backing SQL database that the batch with
>>>>>> the exact txid value 29698959 has been committed -- but I suspect
that
>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>> instances running (paralellismHint 2).
>>>>>>
>>>>>> However, I have no idea why the batch is being retried indefinitely
>>>>>> now nor why it hasn't been successfully acked by Trident.
>>>>>>
>>>>>> Any suggestions on the area (topology component) to focus my research
>>>>>> on?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I'm having problems with my transactional Trident topology. It has
>>>>>> been running fine for about 20 days, and suddenly is stuck processing
a
>>>>>> single batch, with no tuples being emitted nor tuples being persisted
by
>>>>>> the TridentState (IBackingMap).
>>>>>>
>>>>>> It's a simple topology which consumes messages off a Kafka queue.
The
>>>>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>>>>> and I use the trident-mssql transactional TridentState implementation
to
>>>>>> persistentAggregate() data into a SQL database.
>>>>>>
>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>
>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
is
>>>>>> "{"29698959":6487}"
>>>>>>
>>>>>> ... and the attempt count keeps increasing. It seems the batch with
>>>>>> txid 29698959 is stuck, as the attempt count in Zookeeper keeps increasing
>>>>>> -- seems like the batch isn't being acked by Trident and I have no
idea
>>>>>> why, especially since the topology has been running successfully
the last
>>>>>> 20 days.
>>>>>>
>>>>>> I did rebalance the topology on one occasion, after which it
>>>>>> continued running normally. Other than that, no other modifications
were
>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>
>>>>>> Any hints on how to debug the stuck topology? Any other useful info
I
>>>>>> might provide?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danijel@schiavuzzi.com
>>>>>> W: www.schiavuzzi.com
>>>>>> T: +385989035562
>>>>>> Skype: danijel.schiavuzzi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Danijel Schiavuzzi
>>>>>>
>>>>>> E: danije
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: danijel@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danijel@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>

Mime
View raw message