ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Partition recovery issue on partition loss.
Date Thu, 22 Mar 2018 18:25:33 GMT
Igniters,

Is anybody working on this bug? There is a high chance we can add a fix to
2.5 if the community agrees to release it earlier.

--
Denis

On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <dmagda@apache.org> wrote:

> I dared to set fix version to 2.5 and increased the severity. It's
> important to fix the race since we've just released the partition loss
> functionality in 2.4 and it's already broken.
>
> Andrey, please keep us posted. If you didn't fix it, we would need to find
> another contributor.
>
> --
> Denis
>
> On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dpavlov.spb@gmail.com>
> wrote:
>
>> Hi Andrew Mashenkov,
>>
>> would you like to pick up issue?
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <dsetrakyan@apache.org>:
>>
>> > Completely agree, we must fix this. I like the proposed design. We
>> should
>> > also specify that resetLostPartitions() method should return true and
>> > false.
>> >
>> > Val, do you mind updating the ticket with new design?
>> > https://issues.apache.org/jira/browse/IGNITE-7832
>> >
>> > D.
>> >
>> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
>> > valentin.kulichenko@gmail.com> wrote:
>> >
>> > > This indeed looks like a bigger issue. Basically, there is no clear
>> way
>> > (or
>> > > no way at all) to synchronize code that listens to partition loss
>> event,
>> > > and the code that calls resetLostPartitions() method. Example
>> scenario:
>> > >
>> > > 1. Cache is configured with 3rd party persistence.
>> > > 2. One or more nodes fail causing loss of several partitions in
>> memory.
>> > > 3. Ignite blocks access to those partitions according to partition
>> loss
>> > > policy and fires an event.
>> > > 4. Application listens to the event and starts reloading the data from
>> > > store.
>> > > 5. When reloading is complete, application calls
>> resetLostPartitions() to
>> > > restore access.
>> > > 6. Nodes fail again causing another partition loss, new event is
>> fired.
>> > >
>> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
>> > > resetLostPartitions() is called, we end up with inconsistent data.
>> > >
>> > > I believe the only way to fix this is to add corresponding topology
>> > version
>> > > to partition loss event, and also add it as a parameter for
>> > > resetLostPartitions().
>> > > This way if resetLostPartitions() is invoked with a version that is
>> not
>> > the
>> > > latest anymore, the invocation will be ignored.
>> > >
>> > > The only problem with this approach  is that topology version itself
>> is
>> > > currently not a part of public API. It needs to be properly exposed
>> there
>> > > first.
>> > >
>> > > -Val
>> > >
>> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dmagda@apache.org>
>> wrote:
>> > >
>> > > > Just in case here is you can find the present documentation:
>> > > >
>> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
>> > > >
>> > > > Let us know what needs to be updated once the issues reported by you
>> > are
>> > > > addressed.
>> > > >
>> > > > --
>> > > > Denis
>> > > >
>> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
>> > > > andrey.mashenkov@gmail.com> wrote:
>> > > >
>> > > > > Hi Igniters,
>> > > > >
>> > > > > I've found we no documentation how user can recover cache from
>> > > cacheStore
>> > > > > in case of partition loss.
>> > > > > Ignite provides some instruments (methods and events) that should
>> > help
>> > > > user
>> > > > > to solve this problem,
>> > > > > but looks like these instruments have an architecture lack.
>> > > > >
>> > > > > The first one is an usability issue. Ignite provides partition
>> loss
>> > > event
>> > > > > to user can handle this, but Ignite fires an event per partition.
>> > > > > Why we can't have an event with list of lost partitions?
>> > > > >
>> > > > > The second one is a bug. Ignite.resetLostPartitions() method
>> doesn't
>> > > care
>> > > > > about what topology version recovered partitions belonged to.
>> > > > > Tthere is a race, when user call this method after a node was
>> failed,
>> > > but
>> > > > > right before Ignite fire an event.
>> > > > > So, it is possible state of just lost partitions will be reseted
>> > > > > unexpectedly.
>> > > > >
>> > > > >
>> > > > > I've created a ticket for this [1] and think we should rethink
the
>> > > > > architecture of the partition recovery mechanics and improve
>> > > > documentation.
>> > > > > Any thoughts?
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best regards,
>> > > > > Andrey V. Mashenkov
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message