storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Jackson <jasonj...@gmail.com>
Subject Re: Topology is stuck
Date Thu, 10 Apr 2014 19:27:37 GMT
Storm is using curator.

Given an ensemble of size 5, what if a single client which is connected to
zk member 0 performs a write quorum operation which succeeds after updating
zk members 0, 1, 2, then loses it's connection to zk member 0, and
reconnects (using curator) to zk member 3 and performs a read for the same
zknode it recently updates. Zk Member 3 has not yet been synchronized with
that latest write. Unless the client requests a forced sync, in my
understanding it will read an older value. Is this not true? When does the
synchronization happen.


On Thu, Apr 10, 2014 at 3:05 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Jason,
>
> A single client is guaranteed to have strict consistency in its own reads
> and writes.  If the write has occurred according to the client, then all
> subsequent reads by that client will show it.  This applies even if the
> client does a write, is disconnected from ZK, reconnects automatically and
> then reads.  The only place that the weaker consistency applies is that if
> A successfully writes to ZK and then sends an out-of-band message to B and
> B looks at ZK upon receiving the notification from A.  In that case, B may
> not see the update from A right away.
>
> The most common source of hangs such as you describe that I know about are
> cases where change notification handlers are not coded correctly and they
> lose the watcher on a status variable by forgetting to reset it when
> handling a change.  This can happen due to exceptions.  The best way to
> avoid such problems is to use a higher level library such as Curator.  I
> forget if Storm already uses Curator, but I seem to remember not.
>
>
>
>
> On Thu, Apr 10, 2014 at 1:39 AM, Jason Jackson <jasonjckn@gmail.com>wrote:
>
>> My idea for the bug was that trident expects to read from zookeeper what
>> was recently written zookeeper for the same zknode, and due to sequential
>> consistency it sometimes reads an older value even though it just wrote a
>> newer value. I could be way off the mark though, it's just an idea to
>> explore more.
>>
>>
>> On Thu, Apr 10, 2014 at 1:36 AM, Jason Jackson <jasonjckn@gmail.com>wrote:
>>
>>> Hi Ted, thanks for clearing up the language, I intended to express
>>> sequential consistency then.
>>>
>>> Yes you could do a forced sync too, that would be another way good test.
>>>
>>> Taylor, the bug that I witnessed only occurs after you leave a trident
>>> topology running for at least a day. One day it'll just stop making
>>> progress and re-attempt the same batch forever.  Unfortunately I can't send
>>> the particular trident code to you, but I don't think there's anything
>>> unique about it. I suspect any trident topology could reproduce the bug if
>>> ran for a week. Other correlated factors may include that the trident
>>> topology has to occasionally fail batches, the zookeeper cluster has to be
>>> under significant load from other applications beyond trident. I don't many
>>> much details unfortunately.
>>>
>>> -Jason
>>>
>>>
>>>
>>>
>>> On Wed, Apr 9, 2014 at 3:03 PM, Ted Dunning <ted.dunning@gmail.com>wrote:
>>>
>>>>
>>>> In what sense do you mean when you say that reads in ZK are eventually
>>>> consistent?
>>>>
>>>> You may get a slightly old value, but you are guaranteed to see a
>>>> consistent history.  That is, if a value has values (which include version
>>>> numbers) v_1 ... v_n, then if you see v_i, you will never see v_j where j<i.
>>>>
>>>> You can also guarantee that you don't even see delayed values by using
>>>> sync.
>>>>
>>>> Normally when people say "eventually consistent" they mean that two
>>>> participants can see inconsistent histories under partition.  That isn't
>>>> possible in ZK.  As I understand it, ZK would be better described as
>>>> providing sequential consistency since all observers will see all updates
>>>> in the same order.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 9, 2014 at 2:50 PM, Jason Jackson <jasonjckn@gmail.com>wrote:
>>>>
>>>>> I have one theory that because reads in zookeeper are eventually
>>>>> consistent, this is a necessary condition for the bug to manifest. So
one
>>>>> way to test this hypothesis is to run a zookeeper ensemble with 1 node,
or
>>>>> a zookeeper ensemble configured for 5 nodes, but take 2 of them offline,
so
>>>>> that every write operation only succeeds if every member of the ensemble
>>>>> sees the write. This should produce strong consistent reads. If you run
>>>>> this test, let me know what the results are. (Clearly this isn't a good
>>>>> production system though as you're making a tradeoff for lower availability
>>>>> in favor of greater consistency, but the results could help narrow down
the
>>>>> bug)
>>>>>
>>>>>
>>>>> On Wed, Apr 9, 2014 at 2:43 PM, Jason Jackson <jasonjckn@gmail.com>wrote:
>>>>>
>>>>>> Yah it's probably a bug in trident. It would be amazing if someone
>>>>>> figured out the fix for this. I spent about 6 hours looking into,
but
>>>>>> couldn't figure out why it was occuring.
>>>>>>
>>>>>> Beyond fixing this, one thing you could do to buy yourself time is
>>>>>> disable batch retries in trident. There's no option for this in the
API,
>>>>>> but it's like a 1 or 2 line change to the code. Obviously you loose
exactly
>>>>>> once semantics, but at least you would have a system that never falls
>>>>>> behind real-time.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 9, 2014 at 1:10 AM, Danijel Schiavuzzi <
>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>
>>>>>>> Thanks Jason. However, I don't think that was the case in my
stuck
>>>>>>> topology, otherwise I'd have seen exceptions (thrown by my Trident
>>>>>>> functions) in the worker logs.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <jasonjckn@gmail.com>wrote:
>>>>>>>
>>>>>>>> An example of "corrupted input" that causes a batch to fail
would
>>>>>>>> be for example if you expected a schema to your data that
you read off
>>>>>>>> kafka, or some queue, and for whatever reason the data didn't
conform to
>>>>>>>> your schema and the function that you implement that you
pass to
>>>>>>>> stream.each throws an exception when this unexpected situation
occurs. This
>>>>>>>> would cause the batch to be retried, but it's deterministically
failing, so
>>>>>>>> the batch will be retried forever.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi <
>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Jason,
>>>>>>>>>
>>>>>>>>> Could you be more specific -- what do you mean by "corrupted
>>>>>>>>> input"? Do you mean that there's a bug in Trident itself
that causes the
>>>>>>>>> tuples in a batch to somehow become corrupted?
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>>
>>>>>>>>> Danijel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Monday, April 7, 2014, Jason Jackson <jasonjckn@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This could happen if you have corrupted input that
always causes
>>>>>>>>>> a batch to fail and be retried.
>>>>>>>>>>
>>>>>>>>>> I have seen this behaviour before and I didn't see
corrupted
>>>>>>>>>> input. It might be a bug in trident, I'm not sure.
If you figure it out
>>>>>>>>>> please update this thread and/or submit a patch.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi
<
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>> To (partially) answer my own question -- I still
have no idea on
>>>>>>>>>> the cause of the stuck topology, but re-submitting
the topology helps --
>>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi
<
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Also, I did have multiple cases of my IBackingMap
workers dying
>>>>>>>>>> (because of RuntimeExceptions) but successfully restarting
afterwards (I
>>>>>>>>>> throw RuntimeExceptions in the BackingMap implementation
as my strategy in
>>>>>>>>>> rare SQL database deadlock situations to force a
worker restart and to
>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>
>>>>>>>>>> From the logs, one such IBackingMap worker death
(and subsequent
>>>>>>>>>> restart) resulted in the Kafka spout re-emitting
the pending tuple:
>>>>>>>>>>
>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter
[INFO]
>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>
>>>>>>>>>> This is of course the normal behavior of a transactional
>>>>>>>>>> topology, but this is the first time I've encountered
a case of a batch
>>>>>>>>>> retrying indefinitely. This is especially suspicious
since the topology has
>>>>>>>>>> been running fine for 20 days straight, re-emitting
batches and restarting
>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>
>>>>>>>>>> I can see in my IBackingMap backing SQL database
that the batch
>>>>>>>>>> with the exact txid value 29698959 has been committed
-- but I suspect that
>>>>>>>>>> could come from another BackingMap, since there are
two BackingMap
>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>
>>>>>>>>>> However, I have no idea why the batch is being retried
>>>>>>>>>> indefinitely now nor why it hasn't been successfully
acked by Trident.
>>>>>>>>>>
>>>>>>>>>> Any suggestions on the area (topology component)
to focus my
>>>>>>>>>> research on?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi
<
>>>>>>>>>> danijel@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I'm having problems with my transactional Trident
topology. It
>>>>>>>>>> has been running fine for about 20 days, and suddenly
is stuck processing a
>>>>>>>>>> single batch, with no tuples being emitted nor tuples
being persisted by
>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>
>>>>>>>>>> It's a simple topology which consumes messages off
a Kafka queue.
>>>>>>>>>> The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql
transactional
>>>>>>>>>> TridentState implementation to persistentAggregate()
data into a SQL
>>>>>>>>>> database.
>>>>>>>>>>
>>>>>>>>>> In Zookeeper I can see Storm is re-trying a batch,
i.e.
>>>>>>>>>>
>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>
>>>>>>>>>> ... and the attempt count keeps increasing. It seems
the batch
>>>>>>>>>> with txid 29698959 is stuck, as the attempt count
in Zookeeper keeps
>>>>>>>>>> increasing -- seems like the batch isn't being acked
by Trident and I have
>>>>>>>>>> no idea why, especially since the topology has been
running successfully
>>>>>>>>>> the last 20 days.
>>>>>>>>>>
>>>>>>>>>> I did rebalance the topology on one occasion, after
which it
>>>>>>>>>> continued running normally. Other than that, no other
modifications were
>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>
>>>>>>>>>> Any hints on how to debug the stuck topology? Any
other useful
>>>>>>>>>> info I might provide?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: danije
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: danijel@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: danijel@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message