ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Павлухин Иван <vololo...@gmail.com>
Subject Re: Exceptions thrown in IndexingSpi and "fail fast" principle
Date Thu, 10 Jan 2019 18:12:56 GMT

It's really hard question. Generally fail-fast principle is a good
idea. Ignoring every error can lead to real damage.

But in fact it is very general statement. Pure fail-fast sounds not
good solution for a server application which should work stable for
many hours. And nobody is free from bugs. So, some kind of "defensive
programming" is needed here. I believe that it should be solved
architecturally. Some code could be executed in blocks where all
exceptions are ignored. Other code (critical core) should lead to node
fails in case of unexpected errors to prevent a damage from
unpredictable consequences.

Regarding IndexingSpi. I cannot immediately reason what exceptions and
when can be ignored. I prefer to keep things easy and I suppose that
every behavior could be specified in API. It is really hard to say why
it is possible to ignore any exception thrown by external spi
implementation. And an implementation itself can be written
defensively and catch any Throwable, but also taking care about
maintaining it's state consistent after experiencing an error.
Shortly, option B. generally does not look so bad for me? Cannot we
require defensive style for ndexingSpi implementations?

вт, 4 дек. 2018 г. в 20:49, Vyacheslav Daradur <daradurvs@gmail.com>:
> Regarding PME:
> I think some kind of exceptions may be caught in
> 'GridDhtPartitionsExchangeFuture' to be able to notify the
> coordinator.
> For example, if an exception occurred during cache or index (or
> something else) creation on some node, the node can notify coordinator
> and coordinator should be able to make a decision on how to handle the
> situation. AFAIK a message 'DynamicCacheChangeFailureMessage' will be
> sent in case of some kind error.
> It would be great if the most actions managed by PME will be handled
> in the same manner. For doing this we can specify some sort of
> exception types and rules on how to handle them, so the coordinator
> will be able to finish hang PME across the cluster.
> On Tue, Dec 4, 2018 at 8:01 PM Ilya Kasnacheev
> <ilya.kasnacheev@gmail.com> wrote:
> >
> > Hello!
> >
> > Currently, Apache Ignite is mostly written in "fail fast" fashion.
> >
> > All of Apache Ignite codebase is assumed to have no bugs. When an
> > unexpected exception happens, it will be printed to log and will usually *leave
> > current operation hanging forever*. This is useful to developers since they
> > can spot problems right away and to users since they can avoid further data
> > loss, but it often leads to *data loss of current data*.
> >
> > The most notorious case of such errors is hanging PMEs. When PME handling
> > on a single node in the cluster results in an exception, the whole cluster
> > will hang up forever until this node is killed. I guess you can observe
> > data loss after exceptions during rebalance. You can also have various
> > operations hanging once remote node throws an unexpected exception.
> >
> > Most recently I'm trying to fix IgniteErrorOnRebalanceTest in IGNITE-9842.
> > It tests exception thrown from IndexingSpi, it never worked and it leads to
> > silent data loss. When baseline is introduced, it will now lead to hanging
> > PME.
> >
> > Should we fight this problem? IndexingSpi implementation is external to us,
> > we should either:
> > A. Catch any exception that it would throw. If it was thrown during
> > rebalance, ignore it with warning to avoid data loss.
> > B. Assume that it never throws exceptions (or that it will only throw
> > IgniteSpiException). As soon as any exception is thrown, the behavior of
> > cluster is undefined. This is current behavior.
> >
> > *Are we ready to make a leap from B to A?*
> >
> > Note that currently, if an exception is thrown from IndexingSpi, an
> > operation will fail in mid-flight, meaning that part of data could be
> > updated and the rest was not. It is possible that entry was added to cache
> > but not indexed by SQL, for example. We will need to be able to roll back
> > any operation when error occurs.
> >
> > With regard to PME:
> > - When there is an exception during exchange, we should be able to switch
> > back to previous topology version on all nodes.
> > - That means the node which was trying to join is kicked from topology (and
> > not the one that had this exception thrown). Or the cache is not created,
> > or not destroyed, etc.
> > - No data loss since all existing nodes happily continue to work on the old
> > topology version and new node did not have any data yet.
> >
> > Basically, every remote operation should be guarded with a fallback where a
> > message is sent to caller when operation did not succeed. This will mean
> > that no operation ever hangs.
> >
> > WDYT?
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> --
> Best Regards, Vyacheslav D.

Best regards,
Ivan Pavlukhin

View raw message