ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Дмитрий Сорокин <sbt.sorokin....@gmail.com>
Subject Re: Ignite Enhancement Proposal #7 (Internal problems detection)
Date Wed, 29 Nov 2017 11:09:01 GMT
Vladimir,

At the moment policy looks like so:

/**
 * Policy that defines how node will process the failures. Note that default
 * failure processing policy is defined by {@link
IgniteConfiguration#DFLT_FLR_PLC} property.
 */
public enum FailureProcessingPolicy {
    /** Restart jvm. */
    RESTART_JVM,

    /** Stop. */
    STOP,

    /** Noop. */
    NOOP;
}

Can You give an example which different event (failure) types need
different reactions?
We expect that all failures when some ignite system worker (or other
critical component) will broken, need same policy for same node.


2017-11-29 13:56 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:

> Dmitry,
>
> Thank you, but how FailureProcessingPolicy looks like? It is not clear how
> can I configure different reactions to different event types.
>
> On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин <
> sbt.sorokin.dvl@gmail.com>
> wrote:
>
> > Vladimir,
> >
> > These policies (policy, in fact) can be configured in IgniteConfiguration
> > by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc)
> > method.
> >
> > 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> >
> > > Denis,
> > >
> > > Yes, but can we look at proposed API before we dig into implementation?
> > >
> > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dmagda@apache.org>
> wrote:
> > >
> > > > I think the failure processing policy should be configured via
> > > > IgniteConfiguration in a way similar to the segmentation policies.
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <
> vozerov@gridgain.com>
> > > > wrote:
> > > > >
> > > > > Dmitry,
> > > > >
> > > > > How these policies will be configured? Do you have any API in mind?
> > > > >
> > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dmagda@apache.org>
> > > wrote:
> > > > >
> > > > >> No objections here. Additional policies like EXEC might be added
> > later
> > > > >> depending on user needs.
> > > > >>
> > > > >> —
> > > > >> Denis
> > > > >>
> > > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин
<
> > > > sbt.sorokin.dvl@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> Denis,
> > > > >>> I propose start with first three policies (it's already
> > implemented,
> > > > just
> > > > >>> await some code combing, commit & review).
> > > > >>> About of fourth policy (EXEC) I think that it's rather additional
> > > > >> property
> > > > >>> (some script path) than policy.
> > > > >>>
> > > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dmagda@apache.org>:
> > > > >>>
> > > > >>>> Just provide FailureProcessingPolicy with possible reactions:
> > > > >>>> - NOOP - exceptions will be reported, metrics will be
triggered
> > but
> > > an
> > > > >>>> affected Ignite process won’t be touched.
> > > > >>>> - HAULT (or STOP or KILL) - all the actions of the of
NOOP +
> > Ignite
> > > > >>>> process termination.
> > > > >>>> - RESTART - NOOP actions + process restart.
> > > > >>>> - EXEC - execute a custom script provided by the user.
> > > > >>>>
> > > > >>>> If needed the policy can be set per know failure such
is OOM,
> > > > >> Persistence
> > > > >>>> errors so that the user can act accordingly basing on
a context.
> > > > >>>>
> > > > >>>> —
> > > > >>>> Denis
> > > > >>>>
> > > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <
> > > vozerov@gridgain.com>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> In the first iteration I would focus only on reporting
> > facilities,
> > > to
> > > > >> let
> > > > >>>>> administrator spot dangerous situation. And in the
second
> phase,
> > > when
> > > > >> all
> > > > >>>>> reporting and metrics are ready, we can think on
some automatic
> > > > >> actions.
> > > > >>>>>
> > > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov
<
> > > > >>>> mcherkasov@gridgain.com
> > > > >>>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi Anton,
> > > > >>>>>>
> > > > >>>>>> I don't think that we should shutdown node in
case of
> > > > >>>> IgniteOOMException,
> > > > >>>>>> if one node has no space, then other probably
 don't have it
> > too,
> > > so
> > > > >> re
> > > > >>>>>> -balancing will cause IgniteOOM on all other
nodes and will
> kill
> > > the
> > > > >>>> whole
> > > > >>>>>> cluster. I think for some configurations cluster
should
> survive
> > > and
> > > > >>>> allow
> > > > >>>>>> to user clean cache or/and add more nodes.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Mikhail.
> > > > >>>>>>
> > > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь
"Anton Vinogradov" <
> > > > >>>>>> avinogradov@gridgain.com> написал:
> > > > >>>>>>
> > > > >>>>>>> Igniters,
> > > > >>>>>>>
> > > > >>>>>>> Internal problems may and, unfortunately,
cause unexpected
> > > cluster
> > > > >>>>>>> behavior.
> > > > >>>>>>> We should determine behavior in case any
of internal problem
> > > > >> happened.
> > > > >>>>>>>
> > > > >>>>>>> Well known internal problems can be split
to:
> > > > >>>>>>> 1) OOM or any other reason cause node crash
> > > > >>>>>>>
> > > > >>>>>>> 2) Situations required graceful node shutdown
with custom
> > > > >> notification
> > > > >>>>>>> - IgniteOutOfMemoryException
> > > > >>>>>>> - Persistence errors
> > > > >>>>>>> - ExchangeWorker exits with error
> > > > >>>>>>>
> > > > >>>>>>> 3) Prefomance issues should be covered by
metrics
> > > > >>>>>>> - GC STW duration
> > > > >>>>>>> - Timed out tasks and jobs
> > > > >>>>>>> - TX deadlock
> > > > >>>>>>> - Hanged Tx (waits for some service)
> > > > >>>>>>> - Java Deadlocks
> > > > >>>>>>>
> > > > >>>>>>> I created special issue [1] to make sure
all these metrics
> will
> > > be
> > > > >>>>>>> presented at WebConsole or VisorConsole (what's
preferred?)
> > > > >>>>>>>
> > > > >>>>>>> 4) Situations required external monitoring
implementation
> > > > >>>>>>> - GC STW duration exceed maximum possible
length (node should
> > be
> > > > >>>> stopped
> > > > >>>>>>> before STW finished)
> > > > >>>>>>>
> > > > >>>>>>> All this problems were reported by different
persons
> different
> > > time
> > > > >>>> ago,
> > > > >>>>>>> So, we should reanalyze each of them and,
possible, find
> better
> > > > ways
> > > > >> to
> > > > >>>>>>> solve them than it described at issues.
> > > > >>>>>>>
> > > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues,
feel free to
> mention
> > > > >>>> something
> > > > >>>>>>> else :)
> > > > >>>>>>>
> > > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
> > > > >>>>>>> [2]
> > > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > >>>>>>> 7%3A+Ignite+internal+problems+detection
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message