ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Partition recovery issue on partition loss.
Date Thu, 15 Mar 2018 18:04:18 GMT
I dared to set fix version to 2.5 and increased the severity. It's
important to fix the race since we've just released the partition loss
functionality in 2.4 and it's already broken.

Andrey, please keep us posted. If you didn't fix it, we would need to find
another contributor.

--
Denis

On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dpavlov.spb@gmail.com>
wrote:

> Hi Andrew Mashenkov,
>
> would you like to pick up issue?
>
> Sincerely,
> Dmitriy Pavlov
>
> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <dsetrakyan@apache.org>:
>
> > Completely agree, we must fix this. I like the proposed design. We should
> > also specify that resetLostPartitions() method should return true and
> > false.
> >
> > Val, do you mind updating the ticket with new design?
> > https://issues.apache.org/jira/browse/IGNITE-7832
> >
> > D.
> >
> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> > valentin.kulichenko@gmail.com> wrote:
> >
> > > This indeed looks like a bigger issue. Basically, there is no clear way
> > (or
> > > no way at all) to synchronize code that listens to partition loss
> event,
> > > and the code that calls resetLostPartitions() method. Example scenario:
> > >
> > > 1. Cache is configured with 3rd party persistence.
> > > 2. One or more nodes fail causing loss of several partitions in memory.
> > > 3. Ignite blocks access to those partitions according to partition loss
> > > policy and fires an event.
> > > 4. Application listens to the event and starts reloading the data from
> > > store.
> > > 5. When reloading is complete, application calls resetLostPartitions()
> to
> > > restore access.
> > > 6. Nodes fail again causing another partition loss, new event is fired.
> > >
> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> > > resetLostPartitions() is called, we end up with inconsistent data.
> > >
> > > I believe the only way to fix this is to add corresponding topology
> > version
> > > to partition loss event, and also add it as a parameter for
> > > resetLostPartitions().
> > > This way if resetLostPartitions() is invoked with a version that is not
> > the
> > > latest anymore, the invocation will be ignored.
> > >
> > > The only problem with this approach  is that topology version itself is
> > > currently not a part of public API. It needs to be properly exposed
> there
> > > first.
> > >
> > > -Val
> > >
> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dmagda@apache.org>
> wrote:
> > >
> > > > Just in case here is you can find the present documentation:
> > > >
> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> > > >
> > > > Let us know what needs to be updated once the issues reported by you
> > are
> > > > addressed.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > > > andrey.mashenkov@gmail.com> wrote:
> > > >
> > > > > Hi Igniters,
> > > > >
> > > > > I've found we no documentation how user can recover cache from
> > > cacheStore
> > > > > in case of partition loss.
> > > > > Ignite provides some instruments (methods and events) that should
> > help
> > > > user
> > > > > to solve this problem,
> > > > > but looks like these instruments have an architecture lack.
> > > > >
> > > > > The first one is an usability issue. Ignite provides partition loss
> > > event
> > > > > to user can handle this, but Ignite fires an event per partition.
> > > > > Why we can't have an event with list of lost partitions?
> > > > >
> > > > > The second one is a bug. Ignite.resetLostPartitions() method
> doesn't
> > > care
> > > > > about what topology version recovered partitions belonged to.
> > > > > Tthere is a race, when user call this method after a node was
> failed,
> > > but
> > > > > right before Ignite fire an event.
> > > > > So, it is possible state of just lost partitions will be reseted
> > > > > unexpectedly.
> > > > >
> > > > >
> > > > > I've created a ticket for this [1] and think we should rethink the
> > > > > architecture of the partition recovery mechanics and improve
> > > > documentation.
> > > > > Any thoughts?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message