ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Павлухин Иван <vololo...@gmail.com>
Subject Re: Non-blocking PME Phase One (Node fail)
Date Thu, 19 Sep 2019 09:09:33 GMT
Anton, folks,

Out of curiosity. Do we have some measurements of PME execution in an
environment similar to some real-world one? In my mind some kind of
"profile" showing durations of different PME phases with an indication
where we are "blocked" would be ideal.

ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av@apache.org>:
>
> Alexey,
>
> >> Can you describe the complete protocol changes first
> The current goal is to find a better way.
> I had at least 5 scenarios discarded because of finding corner cases
> (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> That's why I explained what we able to improve and why I think it works.
>
> >> we need to remove this property, not add new logic that relies on it.
> Agree.
>
> >> How are you going to synchronize this?
> Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1
> still possible with some optimizations.
>
> "Zombie eating transactions" case can be theoretically solved, I think.
> As I said before we may perform "Local switch" in case affinity was not
> changed (except failed mode miss) other cases require regular PME.
> In this case, new-primary is an ex-backup and we expect that old-primary
> will try to update new-primary as a backup.
> New primary will handle operations as a backup until it notified it's a
> primary now.
> Operations from ex-primary will be discarded or remapped once new-primary
> notified it became the primary.
>
> Discarding is a big improvement,
> remapping is a huge improvement,
> there is no 100% warranty that ex-primary will try to update new-primary as
> a backup.
>
> A lot of corner cases here.
> So, seems minimized sync is a better solution.
>
> Finally, according to your and Alexey Scherbakov's comments, the better
> case is just to improve PME to wait for less, at least now.
> Seems, we have to wait for (or cancel, I vote for this case - any
> objections?) operations related to the failed primaries and wait for
> recovery finish (which is fast).
> In case affinity was changed and backup-primary switch (not related to the
> failed primaries) required between the owners or even rebalance required,
> we should just ignore this and perform only "Recovery PME".
> Regular PME should happen later (if necessary), it can be even delayed (if
> necessary).
>
> Sounds good?
>
> On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> alexey.goncharuk@gmail.com> wrote:
>
> > Anton,
> >
> > I looked through the presentation and the ticket but did not find any new
> > protocol description that you are going to implement. Can you describe the
> > complete protocol changes first (to save time for both you and during the
> > later review)?
> >
> > Some questions that I have in mind:
> >  * It looks like that for "Local Switch" optimization you assume that node
> > failure happens immediately for the whole cluster. This is not true - some
> > nodes may "see" a node A failure, while others still consider it alive.
> > Moreover, node A may not know yet that it is about to be failed and process
> > requests correctly. This may happen, for example, due to a long GC pause on
> > node A. In this case, node A resumes it's execution and proceeds to work as
> > a primary (it has not received a segmentation event yet), node B also did
> > not receive the A FAILED event yet. Node C received the event, ran the
> > "Local switch" and assumed a primary role, node D also received the A
> > FAILED event and switched to the new primary. Transactions from nodes B and
> > D will be processed simultaneously on different primaries. How are you
> > going to synchronize this?
> >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
> > property, not add new logic that relies on it. There is no way a user can
> > calculate this property or adjust it in runtime. If a user decreases this
> > property below a safe value, we will get inconsistent update counters and
> > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
> > value and will push HWM forward very quickly, much faster than during
> > regular updates (you will have to increment it for each partition)
> >
> > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av@apache.org>:
> >
> > > Igniters,
> > >
> > > Recently we had a discussion devoted to the non-blocking PME.
> > > We agreed that the most important case is a blocking on node failure and
> > it
> > > can be splitted to:
> > >
> > > 1) Affected partition’s operations latency will be increased by node
> > > failure detection duration.
> > > So, some operations may be freezed for 10+ seconds at real clusters just
> > > waiting for a failed primary response.
> > > In other words, some operations will be blocked even before blocking PME
> > > started.
> > >
> > > The good news here that "bigger cluster decrease blocked operations
> > > percent".
> > >
> > > Bad news that these operations may block non-affected operations at
> > > - customers code (single_thread/striped pool usage)
> > > - multikey operations (tx1 one locked A and waits for failed B,
> > > non-affected tx2 waits for A)
> > > - striped pools inside AI (when some task wais for tx.op() in sync way
> > and
> > > the striped thread is busy)
> > > - etc ...
> > >
> > > Seems, we already, thanks to StopNodeFailureHandler (if configured),
> > always
> > > send node left event before node stop to minimize the waiting period.
> > > So, only cases cause the hang without the stop are the problems now.
> > >
> > > Anyway, some additional research required here and it will be nice if
> > > someone willing to help.
> > >
> > > 2) Some optimizations may speed-up node left case (eliminate upcoming
> > > operations blocking).
> > > A full list can be found at presentation [1].
> > > List contains 8 optimizations, but I propose to implement some at phase
> > one
> > > and the rest at phase two.
> > > Assuming that real production deployment has Baseline enabled we able to
> > > gain speed-up by implementing the following:
> > >
> > > #1 Switch on node_fail/node_left event locally instead of starting real
> > PME
> > > (Local switch).
> > > Since BLT enabled we always able to switch to the new-affinity primaries
> > > (no need to preload partitions).
> > > In case we're not able to switch to new-affinity primaries (all missed or
> > > BLT disabled) we'll just start regular PME.
> > > The new-primary calculation can be performed locally or by the
> > coordinator
> > > (eg. attached to the node_fail message).
> > >
> > > #2 We should not wait for any already started operations completion
> > (since
> > > they not related to failed primary partitions).
> > > The only problem is a recovery which may cause update-counters
> > duplications
> > > in case of unsynced HWM.
> > >
> > > #2.1 We may wait only for recovery completion (Micro-blocking switch).
> > > Just block (all at this phase) upcoming operations during the recovery by
> > > incrementing the topology version.
> > > So in other words, it will be some kind of PME with waiting, but it will
> > > wait for recovery (fast) instead of finishing current operations (long).
> > >
> > > #2.2 Recovery, theoretically, can be async.
> > > We have to solve unsynced HWM issue (to avoid concurrent usage of the
> > same
> > > counters) to make it happen.
> > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> > new-primary
> > > and continue recovery in an async way.
> > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> > committed
> > > transactions we expect between "the first backup committed tx1" and "the
> > > last backup committed the same tx1".
> > > I propose to use it to specify the number of prepared transactions we
> > > expect between "the first backup prepared tx1" and "the last backup
> > > prepared the same tx1".
> > > Both cases look pretty similar.
> > > In this case, we able to make switch fully non-blocking, with async
> > > recovery.
> > > Thoughts?
> > >
> > > So, I'm going to implement both improvements at "Lightweight version of
> > > partitions map exchange" issue [2] if no one minds.
> > >
> > > [1]
> > >
> > >
> > https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> > >
> >



-- 
Best regards,
Ivan Pavlukhin

Mime
View raw message