lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-13579) Create resource management API
Date Thu, 25 Jul 2019 11:59:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892670#comment-16892670
] 

Andrzej Bialecki  commented on SOLR-13579:
------------------------------------------

The main scenario that prompted this development was a need to control the aggregated cache
sizes across all cores in a CoreContainer in a multi-tenant (uncooperative) situation. However,
it seemed like a similar approach would be applicable for controlling other runtime usage
of resources in a Solr node - hence the attempt to come up with a generic framework.

A particular component may support resource management of several of its aspects. Eg. a {{SolrIndexSearcher}}
can have a "cache" RAM usage aspect, "mergeIO" throttling aspect, "mergeThreadCount" aspect,
"queryThreadCount" aspect, etc. Each of these aspects can be managed by a different global
pool that defines total resource limits of a given type. Currently a component can be registered
only in a single pool of a given type, in order to avoid conflicting instructions.

In the current patch the component registration and pool creation parts are primitive - the
default pools are created statically and components are forced to register in a dedicated
pool. In the future this could be configurable - eg. components from cores belonging to different
collections may belong to different pools with different limits / priorities.

In the following stories, there are always two aspects of resource management - control and
optimization. The control aspect ensures that the specified hard limits are observed, while
the optimization aspect ensures that each component uses resources in an optimal way. The
focus of this JIRA issue is mainly on the control aspect, with optimization to follow later.

h2. Story 1: controlling global cache RAM usage in a Solr node
{{SolrIndexSearcher}} caches are currently configured statically, using either item count
limits or {{maxRamMB}} limits. We can only specify the limit per-cache and then we can limit
the number of cores in a node to arrive at a hard total upper limit.

However, this is not enough because it leads to keeping the heap at the upper limit when the
actual consumption by caches might be far lesser. It'd be nice for a more active core to be
able to use more heap for caches than another core with less traffic while ensuring that total
heap usage never exceeds a given threshold (the optimization aspect). It is also required
that total heap usage of caches doesn't exceed the max threshold to ensure proper behavior
of a Solr node (the control aspect).

In order to do this we need a control mechanism that is able to adjust individual cache sizes
per core, based on the total hard limit and the actual current "need" of a core, defined as
a combination of hit ratio, QPS, and other arbitrary quality factors / SLA. This control mechanism
also needs to be able to forcibly reduce excessive usage (evenly? prioritized by collection's
SLA?) when the aggregated heap usage exceeds the threshold.

In terms of the proposed API this scenario would work as follows:
 * a global resource pool "searcherCachesPool" is created with a single hard limit on eg.
total {{maxRamMB}}.
 * this pool knows how to manage components of a "cache" type - what parameters to monitor
and what parameters to use in order to control their resource usage. This logic is encapsulated
in {{CacheManagerPlugin}}.
 * all searcher caches from all cores register themselves in this pool for the purpose of
managing their "cache" aspect.
 * the plugin is executed periodically to check the current resource usage of all registered
caches, using eg. the aggregated value of {{ramBytesUsed}}.
 * if this aggregated value exceeds the total {{maxRamMB}} limit configured for the pool then
the plugin adjusts the {{maxRamMB}} setting of each cache in order to reduce the total RAM
consumption - currently this uses a simple proportional formula without any history (the P
part of PID), with a dead-band in order to avoid thrashing. Also, for now, this addresses
only the control aspect (exceeding a hard threshold) and not the optimization, i.e. it doesn't
proactively reduce / increase {{maxRamMB}} based on hit rate.
 * as a result of this action some of the cache content will be evicted sooner and more aggressively
than initially configured, thus freeing more RAM.
 * when the memory pressure decreases the {{CacheManagerPlugin}} re-adjusts the {{maxRamMB}}
settings of each cache to the initially configured values. Again, the current implementation
of this algorithm is very simple but can be easily improved because it's cleanly separated
from implementation details of each cache.

h2. Story 2: controlling global IO usage in a Solr node.

Similarly to the scenario above, currently we can only statically configure merge throttling
(RateLimiter) per core but we can't monitor and control the total IO rates across all cores,
which may easily lead to QoS degradation of other cores due to excessive merge rates of a
particular core.

Although {{RateLimiter}} parameters can be dynamically adjusted, this functionality is not
exposed, and there's no global control mechanism to ensure "fairness" of allocation of available
IO (which is limited) between competing cores.

In terms of the proposed API this scenario would work as follows:
 * a global resource pool "mergeIOPool" is created with a single hard limit {{maxMBPerSec}},
which is picked based on a fraction of the available hardware capabilities that still provides
acceptable performance.
 * this pool knows how to manage components of a "mergeIO" type. It monitors their current
resource usage (using {{SolrIndexWriter}} metrics) and knows how to adjust each core's {{ioThrottle}}.
This logic is encapsulated in {{MergeIOManagerPlugin}} (doesn't exist yet).
 * all {{SolrIndexWriter}}-s in all cores register themselves in this pool for the purpose
of managing their "mergeIO" aspect.

The rest of the scenario is similar to the Story 1. As a result of the plugin's adjustments
the merge IO rate of some of the cores may be decreased / increased according to the available
pool of total IO.

> Create resource management API
> ------------------------------
>
>                 Key: SOLR-13579
>                 URL: https://issues.apache.org/jira/browse/SOLR-13579
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Major
>         Attachments: SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch,
SOLR-13579.patch
>
>
> Resource management framework API supporting the goals outlined in SOLR-13578.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message