ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Magda <dma...@apache.org>
Subject Re: Automatic Handling of Long Stop-the-World Pauses
Date Mon, 02 Jul 2018 17:54:21 GMT
Igniters,

Pulling this discussion up. Any thoughts?

--
Denis

On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <dmagda@apache.org> wrote:

> Igniters,
>
> It's a pleasure to see how our project is evolving in a directing of being
> a self-healing solution:
>
>    - Ignite can already handle critical failures such as OOM, File I/O
>    issues, etc. [1]
>    - There is an endeavor to fix cluster lock-ins due to partition map
>    exchange issues. [2]
>
> There is one more notorious problem that might affect Ignite deployments
> which is long stop-the-world GC pauses.
>
> I know we did a little progress in this direction [3] by providing
> particular metrics that help to monitor the pauses. Why don't we keep the
> pace and teach Ignite to help itself if it sees there is a node that brings
> down overall cluster performance due to an STP?
>
> I would create policies similar to the critical failures policies [4] or
> just add a long STP to the list of critical failures and reuse existing
> functionality.
>
> Thoughts? Anyone who'd like to implement the feature?
>
> [1] https://apacheignite.readme.io/docs/critical-failures-handling
> [2]
> http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> [3] https://issues.apache.org/jira/browse/IGNITE-6171
> [4]
> https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message