ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: Facility to detect long STW pauses and other system response degradations
Date Tue, 21 Nov 2017 10:55:51 GMT
Honestly I do not understand why we need separate process to monitor node's
state. Is it possible to list advantages of this approach comparing to
in-proc monitoring threads?

On Tue, Nov 21, 2017 at 1:16 PM, Дмитрий Сорокин <sbt.sorokin.dvl@gmail.com>
wrote:

> Don't forget that the high utilization of CPU can occur for reasons other
> than GC STW, and GC log parsing will not help us in that case.
>
>
> вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite
> Developers] <ml+s2346864n24497h16@n4.nabble.com>:
>
> > Denis,
> >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> > Different GC and even same GC at different OS/JVM produce different logs.
> > That's not easy to parse them. But, since http://gceasy.io can do that,
> > it
> > looks to be possible, somehow :) .
> > Do you know any libs or solutions allows to do this at realtime?
> >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic. Let’s define a policy (enumeration) that will define how to deal
> > with an unstable node. The events might be as follows - kill a node,
> > restart a node, trigger a custom script using Runtime.exec or other
> > methods.
> > Yes, it should be similar to segmentation policy + custom script
> > execution.
> >
> >
> > On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]
> > <http:///user/SendEmail.jtp?type=node&node=24497&i=0>> wrote:
> >
> > > My 2 cents.
> > >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > > jstat and parse its GC logs: https://dzone.com/articles/
> > > how-monitor-java-garbage <https://dzone.com/articles/
> > > how-monitor-java-garbage>
> > >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic.
> > > Let’s define a policy (enumeration) that will define how to deal with
> an
> > > unstable node. The events might be as follows - kill a node, restart a
> > > node, trigger a custom script using Runtime.exec or other methods.
> > >
> > > What’d you think? Specifically on point 2.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]
> > <http:///user/SendEmail.jtp?type=node&node=24497&i=1>>
> > > wrote:
> > > >
> > > > Yakov,
> > > >
> > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > >
> > > > We split issue to
> > > > #1 STW duration metrics
> > > > #2 External monitoring allows to stop node during STW
> > > >
> > > >> Testing GC pause with java thread is
> > > >> a bit strange and can give info only after GC pause finishes.
> > > >
> > > > That's ok since it's #1
> > > >
> > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > [hidden email] <http:///user/SendEmail.jtp?type=node&node=24497&i=2>>
> > > > wrote:
> > > >
> > > >> I have tested solution with java-thread and GC logs had contain same
> > > pause
> > > >> values of thread stopping which was detected by java-thread.
> > > >>
> > > >>
> > > >> My log (contains pauses > 100ms):
> > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 507 milliseconds.
> > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 5595 milliseconds.
> > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 3262 milliseconds.
> > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 1737 milliseconds.
> > > >>
> > > >> GC log:
> > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> application
> > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > 0,0000246
> > > >> seconds
> > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> application
> > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > 0,0000252
> > > >> seconds
> > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> application
> > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > 0,0000178
> > > >> seconds    // GOT!
> > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> application
> > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > 0,0000229
> > > >> seconds    // GOT!
> > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > application
> > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > 0,0000223
> > > >> seconds    // GOT!
> > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > application
> > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > 0,0000121
> > > >> seconds    // GOT!
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >>
> > >
> > >
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/Facility-to-detect-long-STW-pauses-and-other-system-
> response-degradations-tp24391p24497.html
> > To unsubscribe from Facility to detect long STW pauses and other system
> > response degradations, click here
> > <http://apache-ignite-developers.2346864.n4.nabble.
> com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=24391&code=
> c2J0LnNvcm9raW4uZHZsQGdtYWlsLmNvbXwyNDM5MXwtMjA0OTY3OTkxOQ==>
> > .
> > NAML
> > <http://apache-ignite-developers.2346864.n4.nabble.
> com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_
> html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace&breadcrumbs=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message