ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Exceptions thrown in IndexingSpi and "fail fast" principle
Date Tue, 04 Dec 2018 17:01:30 GMT

Currently, Apache Ignite is mostly written in "fail fast" fashion.

All of Apache Ignite codebase is assumed to have no bugs. When an
unexpected exception happens, it will be printed to log and will usually *leave
current operation hanging forever*. This is useful to developers since they
can spot problems right away and to users since they can avoid further data
loss, but it often leads to *data loss of current data*.

The most notorious case of such errors is hanging PMEs. When PME handling
on a single node in the cluster results in an exception, the whole cluster
will hang up forever until this node is killed. I guess you can observe
data loss after exceptions during rebalance. You can also have various
operations hanging once remote node throws an unexpected exception.

Most recently I'm trying to fix IgniteErrorOnRebalanceTest in IGNITE-9842.
It tests exception thrown from IndexingSpi, it never worked and it leads to
silent data loss. When baseline is introduced, it will now lead to hanging

Should we fight this problem? IndexingSpi implementation is external to us,
we should either:
A. Catch any exception that it would throw. If it was thrown during
rebalance, ignore it with warning to avoid data loss.
B. Assume that it never throws exceptions (or that it will only throw
IgniteSpiException). As soon as any exception is thrown, the behavior of
cluster is undefined. This is current behavior.

*Are we ready to make a leap from B to A?*

Note that currently, if an exception is thrown from IndexingSpi, an
operation will fail in mid-flight, meaning that part of data could be
updated and the rest was not. It is possible that entry was added to cache
but not indexed by SQL, for example. We will need to be able to roll back
any operation when error occurs.

With regard to PME:
- When there is an exception during exchange, we should be able to switch
back to previous topology version on all nodes.
- That means the node which was trying to join is kicked from topology (and
not the one that had this exception thrown). Or the cache is not created,
or not destroyed, etc.
- No data loss since all existing nodes happily continue to work on the old
topology version and new node did not have any data yet.

Basically, every remote operation should be guarded with a fallback where a
message is sent to caller when operation did not succeed. This will mean
that no operation ever hangs.


Ilya Kasnacheev

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message