ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Goncharuk <alexey.goncha...@gmail.com>
Subject Re: [IEP-35] Monitoring & Profiling. Phase 2
Date Fri, 28 Jun 2019 16:09:07 GMT
Sorry for the duplicate - apparently, Maksim also spotted this regression.
Let's continue the discussion in the separate thread.

пт, 28 июн. 2019 г. в 19:04, Alexey Goncharuk <alexey.goncharuk@gmail.com>:

> Hello Nikolay,
>
> From the latest TC runs I can see a sporadic regression
> in testParallelStartAndStop and testStartManyCaches tests. Digging deeper,
> I see that stopping a single cache when there are many other started caches
> now takes a significant amount of time.
>
> Current suspect is GridCacheAdapter#stop method, which iterates over the
> cache metrics and removes them. The issue is that the
> MetricsRegistry.withPrefix().getMetrics().forEach() internally uses a
> filtered view of all the metrics will effectively iterate over all existing
> metrics in the system, which makes a sequential stop of N caches an O(N^2)
> complexity.
>
> I think we can either make the metrics registry to utilize a SkipListMap,
> which allow us to iterate only on a subset of the metrics by prefix, or
> internally have a trie so that we can remove all metrics with a given
> prefix in O(1) time.
>
> What do you think?
>
> пн, 10 июн. 2019 г. в 13:49, Nikolay Izhikov <nizhikov@apache.org>:
>
>> Hello, Igniters.
>>
>> Since Phase 1 will be merged in master soon I've created the ticket [1]
>> for Phase 2.
>>
>> Scope of Phase 2(copy-paste from the ticket)
>>
>> Ability to collect lists of some internal object Ignite manage.
>> Examples of such objects:
>>
>>   * Caches
>>   * Queries (including continuous queries)
>>   * Services
>>   * Compute tasks
>>   * Distributed Data Structures
>>   * etc...
>>
>>
>> 1. Fields for each list(that doesn't currently exists in Ignite) will be
>> discussed in separate tickets
>> 2. Metric Exporters (optionally) can support list export.
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-11905
>>
>>
>> В Вт, 14/05/2019 в 16:42 +0300, Nikolay Izhikov пишет:
>> > Ticket for IEP.Phase1 created -
>> https://issues.apache.org/jira/browse/IGNITE-11848
>> >
>> >
>> > В Пн, 13/05/2019 в 18:06 +0300, Nikolay Izhikov пишет:
>> > > Hello, Igniters.
>> > >
>> > > We have discussed this IEP [1] with Alexey Goncharyuk, Anton
>> Vinogradov, Andrey Gura, Alexey Scherbakov and Pavel Kovalenko.
>> > >
>> > > Issues to address:
>> > >
>> > > 1. Study experience of following libs, tools:
>> > >     * OpenTracing
>> > >     * OpenSensus
>> > >     * DropWizard
>> > >
>> > > 2. Support histogram sensor: Sensor that collects values that gets
>> into predefined segments
>> > >
>> > > 3. Use more widely used naming(like in OpenSensus?)
>> > >
>> > > 4. Consider the usage of OpenSensus as a default implementation for
>> local metric storage.
>> > >
>> > > 5. To measure the performance penalty for metrics for 5_000 caches.
>> > >
>> > > 6. Some metrics should be part of public API and others are not(may
>> be changed/removed in release without warnings).
>> > >
>> > > My plan for Phase #1 is the following:
>> > >
>> > > 1. Address the issues.
>> > > 2. Prepare public API
>> > > 3. Prepare PR for monitoring subsystem + existing metrics rewritten
>> with it.
>> > > 4. Prepare a PR with lists of each user API.
>> > > 5. Collect feedback for a #4.
>> > > 6. Design a log exposer. Consider the usage of JFR format or some
>> other widely used, tool compatible format.
>> > >
>> > > [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=112820392
>> > >
>> > > В Чт, 02/05/2019 в 14:02 +0300, Nikolay Izhikov пишет:
>> > > > Hello, Maxim.
>> > > >
>> > > > > How will be recorded throughput sensor values which will require
>> an interval for the rate calculations?
>> > > >
>> > > > I answered to this question in IEP "Design principles":
>> > > >
>> > > > ```
>> > > > Sensors should contain only raw values. No aggregation of numeric
>> metrics on Ignite side.
>> > > > Min, max, avg and other functions are the matter of an external
>> monitoring system.
>> > > > ```
>> > > >
>> > > > Throughput is a function `(S(t2) - S(t1))/(t2-t1)`
>> > > > where S(t) is the sensor value in some point of time t.
>> > > >
>> > > > Seems, throughput calculation is a responsibility of an external
>> system.
>> > > >
>> > > > What do you think?
>> > > >
>> > > > > It seems to me that we can add an additional parameter of
>> `sensitivityLevel` to provide for the user a flexible sensor control (e.g.,
>> INFO, WARN, NOTICE, DEBUG).
>> > > >
>> > > > For now, I think that all sensors and lists will be very(very!)
>> lightweight.
>> > > > So, we should be able to disable/enable it's, for sure.
>> > > >
>> > > > But, we should turn off and turn on the whole Ignite subsystem
>> > > > for the case we have strong performance limitations for a
>> particular workload.
>> > > >
>> > > > So, we have two "level" of monitoring - INFO and DEBUG(for
>> profiling: IEP-35 - Phase 3).
>> > > > For example, AFAIK we can't disable current SQL system views(Why
>> should we?)
>> > > >
>> > > > В Вт, 30/04/2019 в 14:33 +0300, Maxim Muzafarov пишет:
>> > > > > Hello Nikolay,
>> > > > >
>> > > > > I've looked through your PRs changes.
>> > > > >
>> > > > > > Sensors
>> > > > >
>> > > > > How will be recorded throughput sensor values which will require
>> an
>> > > > > interval for the rate calculations? Do we have such an example?
>> For
>> > > > > instance, getAllocationRate() or getEvictionRate(). These metrics
>> are
>> > > > > out of the scope of current PoC and IEP as they are not related
>> to the
>> > > > > user metrics, but it is a good example of a particular metric
>> type.
>> > > > >
>> > > > > It seems to me that we can add an additional parameter of
>> > > > > `sensitivityLevel` to provide for the user a flexible sensor
>> control
>> > > > > (e.g., INFO, WARN, NOTICE, DEBUG).
>> > > > >
>> > > > > It also seems that for the sensors getValue() the completely
>> > > > > functional java approach can be used. Am I right?
>> > > > >
>> > > > > On Mon, 29 Apr 2019 at 11:44, Nikolay Izhikov <
>> nizhikov@apache.org> wrote:
>> > > > > >
>> > > > > > Hello, Vyacheslav.
>> > > > > >
>> > > > > > Thanks for the feedback!
>> > > > > >
>> > > > > > > HttpExposer with Jetty's dependencies should be detached>
>> from the core module.
>> > > > > >
>> > > > > > Agreed. module hierarchy is the essence of the next steps.
>> > > > > > For now it just a proof of my ideas for Ignite monitoring
we
>> can discuss.
>> > > > > >
>> > > > > > > I like your approach with 'wrapper' for monitored objects,
>> like don't like using 'ServiceConfiguration' directly as a monitored object
>> for services
>> > > > > >
>> > > > > > Agreed in general.
>> > > > > > Seems, choosing the right data to expose is the matter of
>> separate discussion for each Ignite entities.
>> > > > > > I've planned to file tickets for each entity so anyone
>> interested can share his vision in it.
>> > > > > >
>> > > > > > > In my opinion, each sensor should have a timestamp.
>> > > > > >
>> > > > > > I'm not sure that *every* sensor should have directly
>> associated timestamp.
>> > > > > > Seems, we should support sensors without timestamp for a
>> current monitoring numbers at least.
>> > > > > >
>> > > > > > > Also, it'd be great to have an ability to store a list
of a
>> fixed size> of last N sensors
>> > > > > >
>> > > > > > What use-cases do you know for such sensors?
>> > > > > > We have plans to support fixed size lists to show "Last
N SQL
>> queries" or similar data.
>> > > > > > Essentially, a sensor is just a single value with the name
and
>> known meaning.
>> > > > > >
>> > > > > > > It'd be great if you provide a more extended test to
show the
>> work of> the system.
>> > > > > >
>> > > > > > Sorry, for that :)
>> > > > > > When you run 'MonitoringSelfTest' you should open
>> http://localhost:8080/ignite/monitoring to view exposed info.
>> > > > > > I provide this info in gist -
>> https://gist.github.com/nizhikov/aa1e6222e6a3456472b881b8deb0e24d
>> > > > > >
>> > > > > > I will extend this test to print results to console in the
next
>> iterations - stay tuned :)
>> > > > > >
>> > > > > > В Вс, 28/04/2019 в 23:35 +0300, Vyacheslav Daradur пишет:
>> > > > > > > Hi, Nikolay,
>> > > > > > >
>> > > > > > > I looked through PR and IEP, and I have some comments:
>> > > > > > >
>> > > > > > > It would be better to implement it as a separate module,
I
>> can't say
>> > > > > > > if it is possible for the main part of monitoring or
not, but
>> I
>> > > > > > > believe that HttpExposer with Jetty's dependencies
should be
>> detached
>> > > > > > > from the core module.
>> > > > > > >
>> > > > > > > I like your approach with 'wrapper' for monitored objects,
>> like
>> > > > > > > 'ComputeTaskInfo' in PR, and don't like using
>> 'ServiceConfiguration'
>> > > > > > > directly as a monitored object for services. I believe
we
>> shouldn't
>> > > > > > > mix approaches. It'd be better always use some kind
of
>> container with
>> > > > > > > monitored object's information to work with such data.
>> > > > > > >
>> > > > > > > In my opinion, each sensor should have a timestamp.
Usually
>> monitoring
>> > > > > > > systems aggregate data and build graphics according
to sensors
>> > > > > > > timestamp.
>> > > > > > >
>> > > > > > > Also, it'd be great to have an ability to store a list
of a
>> fixed size
>> > > > > > > of last N sensors, not to miss them without pushing
to an
>> external
>> > > > > > > monitoring system.
>> > > > > > >
>> > > > > > > It'd be great if you provide a more extended test to
show the
>> work of
>> > > > > > > the system. Everybody who looks to PR needs to run
the test
>> and get
>> > > > > > > the info manually to see the completeness of sensors,
this
>> might be
>> > > > > > > simplified by proper test.
>> > > > > > >
>> > > > > > > Thank you!
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Apr 26, 2019 at 5:56 PM Nikolay Izhikov <
>> nizhikov@apache.org> wrote:
>> > > > > > > >
>> > > > > > > > Hello, Igniters.
>> > > > > > > >
>> > > > > > > > I've prepared Proof of Concept for IEP-35 [1]
>> > > > > > > > PR can be found here -
>> https://github.com/apache/ignite/pull/6510
>> > > > > > > >
>> > > > > > > > I've done following changes:
>> > > > > > > >
>> > > > > > > >         1. `GridMonitoringManager`  [2] - simple
>> implementation of manager to store all monitoring info
>> > > > > > > >         2. `HttpPullExposerSpi` [3] - pull exposer
>> implementation that can respond with JSON from
>> http://localhost:8080/ignite/monitoring. JSON content can be veiwed in
>> gist [4]
>> > > > > > > >         3. Compute task start and finish monitoring
in
>> "compute" list [5]
>> > > > > > > >         4. Service registration are monitored
in "service"
>> list - [6]
>> > > > > > > >         5. Current `IgniteSpiMBeanAdapter` rewritten
using
>> `GridMonitoringManager` [7]
>> > > > > > > >
>> > > > > > > > Design principles, monitoring subsystem details
and new
>> Ignite entities can be found in IEP [1].
>> > > > > > > >
>> > > > > > > > My next steps will be:
>> > > > > > > >
>> > > > > > > >         1. Implementation of JMX exposer
>> > > > > > > >         2. Registration of all "lists" and "sensor
groups"
>> as a SQL System view.
>> > > > > > > >         3. Add monitoring for all unmonitoring
Ignite API.
>> (described in IEP).
>> > > > > > > >         4. Rewrite existing jmx metrics using
>> GridMonitoringManager.
>> > > > > > > >
>> > > > > > > > Please, share you thoughts.
>> > > > > > > >
>> > > > > > > > Part of JSON file:
>> > > > > > > > ```
>> > > > > > > >     "COMPUTE": {
>> > > > > > > >       "tasks": {
>> > > > > > > >         "name": "tasks",
>> > > > > > > >         "rows": [
>> > > > > > > >           {
>> > > > > > > >             "id": "0798817a-eeec-4386-9af7-94edb39ffced",
>> > > > > > > >             "sessionId":
>> "a1814f95a61-912451ff-ca7b-4764-a7fd-728f6a900000",
>> > > > > > > >             "data": {
>> > > > > > > >               "taskClasName":
>> "org.apache.ignite.monitoring.MonitoringSelfTest$$Lambda$145/1500885480",
>> > > > > > > >               "startTime": 1556287337944,
>> > > > > > > >               "timeout": 9223372036854776000,
>> > > > > > > >               "execName": null
>> > > > > > > >             },
>> > > > > > > >             "name": "anotherBroadcast"
>> > > > > > > >           }
>> > > > > > > > ```
>> > > > > > > >
>> > > > > > > > [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=112820392
>> > > > > > > > [2]
>> https://github.com/apache/ignite/pull/6510/files#diff-ec7d5cf5e35b99303deb9accee153c50R34
>> > > > > > > > [3]
>> https://github.com/apache/ignite/pull/6510/files#diff-32239c45e0ae3b692af2eae7078e1436R47
>> > > > > > > > [4]
>> https://gist.github.com/nizhikov/aa1e6222e6a3456472b881b8deb0e24d
>> > > > > > > > [5]
>> https://github.com/apache/ignite/pull/6510/files#diff-d651ed29d07bd0c5ce291654a3254cc0R749
>> > > > > > > > [6]
>> https://github.com/apache/ignite/pull/6510/files#diff-0b4e54fbda2b0da1c10eff48416336f6R1606
>> > > > > > > > [7]
>> https://github.com/apache/ignite/pull/6510/files#diff-4398bf118150500e059069b3a1638ec7R61
>> > > > > > >
>> > > > > > >
>> > > > > > >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message