jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürig (JIRA) <j...@apache.org>
Subject [jira] [Updated] (OAK-5468) Ease TarMK Operations
Date Mon, 13 Feb 2017 11:02:41 GMT

     [ https://issues.apache.org/jira/browse/OAK-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Dürig updated OAK-5468:
-------------------------------
    Description: 
h2. Ease of TarMK Operations

This epic is all about simplifying the operational aspects of the TarMK. Broadly this can
be broken down into the following three topics.

h3. Monitoring
* We need to improve monitoring for system load and health. It should be easy for operators
to figure out which parts of the TarMK are within safe bounds and and which are not.
* Failures should be easy to diagnose and pinpoint the root cause. It should be evident if
and how a failures can be fixed by the operator. 

h3. Management
* Management tasks should be easy to use, clear and safe. It should be evident how to achieve
a certain task, what it means to execute it and what its parameters mean (discoverability).
Executing a task should no cause harm to the system because the system is not in the right
state (e.g. running restore concurrently to backup should be safe). 

h3. Tooling
* We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content,
how much content, distribution over space and time, reachability, retention time, garbage,
etc.) Both, online and offline (i.e. post mortem).


h2. Individual improvements

Below is a list of items to address in no specific order. Let's start extracting them into
individual issues linked to this epic as we start tackling this. 

h3. Monitoring
* Throughput (e.g. time to commit, time to save, etc.)
* Thrashing (setting on thereof)
* SNFE (transient vs. catastrophic)
* DSGC
* FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
* Cold standby (progress, liveliness, latency, etc.)
* ...

h3. Management
* Revisit backup/restore (OAK-5103, OAK-4866)
* Coordination of management operations (ability to run conditionally, prevent them from running
concurrently, etc.)

h3. Tooling
* Progress monitor {{oak-run compact}}
* Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove garbage left by
prior crash)
* Bring {{oak-run check}} up to date. Address scalability and performance issues. Include
more useful statistics (e.g. node types, child node lists, content distribution, etc.)
* Changes over time
* Consolidation of various (unversioned) scripts into oak-run like 'node count script', 'node
remove script'.
* Allow connecting tools to a running instance.        
* Snapshotting support: restartable stats collection (snapshot at certain revision, diff to
collect extras)
* "Friendly" output formats that can be easily used by other tools (e.g. Unix tools, Kibana,
etc.)
* Proper usage of stdin and stdout
* Proper exit codes
* Current gap in tooling is around the idea of healing a repository plagued with SNFEs, bridge
the gap between {{oak-run check}} and 'oak console node count script', provide options to
plug in the holes to restore the repository to a consistent state. One idea would be to complement
rolling back the segment store to the last good revision with rolling it forward to a new
and fixed good revisions. The simplest way of fixing is to just replace unreadable items with
empty ones (i.e. "plugging the holes"). From there one could diff this new fixed revision
against the last good revision to asses the damage and see what else needs fixing (e.g. to
regain consistency wrt. to JCR). 
* Classification of tools between development / research/ experimental and production (customer
facing). The latter need a different level of support, maintenance, QE, documentation etc.
Possibly mark via documentation which is which. 
* Group commands from oak-run in namespaces. Assign a different namespace to each persistence
implementation in Oak. Let every implementation parse its own commands. Move commands closer
to their implementation and relieve oak-run from code bloat. See OAK-5437 for further details.



  was:
h2. Ease of TarMK Operations

This epic is all about simplifying the operational aspects of the TarMK. Broadly this can
be broken down into the following three topics.

h3. Monitoring
* We need to improve monitoring for system load and health. It should be easy for operators
to figure out which parts of the TarMK are within safe bounds and and which are not.
* Failures should be easy to diagnose and pinpoint the root cause. It should be evident if
and how a failures can be fixed by the operator. 

h3. Management
* Management tasks should be easy to use, clear and safe. It should be evident how to achieve
a certain task, what it means to execute it and what its parameters mean (discoverability).
Executing a task should no cause harm to the system because the system is not in the right
state (e.g. running restore concurrently to backup should be safe). 

h3. Tooling
* We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content,
how much content, distribution over space and time, reachability, retention time, garbage,
etc.) Both, online and offline (i.e. post mortem).


h2. Individual improvements

Below is a list of items to address in no specific order. Let's start extracting them into
individual issues linked to this epic as we start tackling this. 

h3. Monitoring
* Throughput (e.g. time to commit, time to save, etc.)
* Thrashing (setting on thereof)
* SNFE (transient vs. catastrophic)
* DSGC
* FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
* Cold standby (progress, liveliness, latency, etc.)
* ...

h3. Management
* Revisit backup/restore (OAK-5103, OAK-4866)
* Coordination of management operations (ability to run conditionally, prevent them from running
concurrently, etc.)

h3. Tooling
* Progress monitor {{oak-run compact}}
* Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove garbage left by
prior crash)
* Bring {{oak-run check}} up to date. Address scalability and performance issues. Include
more useful statistics (e.g. node types, child node lists, content distribution, etc.)
* Changes over time
* Consolidation of various (unversioned) scripts into oak-run like 'node count script', 'node
remove script'.
* Allow connecting tools to a running instance.        
* Snapshotting support: restartable stats collection (snapshot at certain revision, diff to
collect extras)
* "Friendly" output formats that can be easily used by other tools (e.g. Unix tools, Kibana,
etc.)
* Proper usage of stdin and stdout
* Proper exit codes
* Current gap in tooling is around the idea of healing a repository plagued with SNFEs, bridge
the gap between {{oak-run check}} and 'oak console node count script', provide options to
plug in the holes, so AEM is usable. One idea would be to complement rolling back the segment
store to the last good revision with rolling it forward to a new and fixed good revisions.
The simplest way of fixing is to just replace unreadable items with empty ones (i.e. "plugging
the wholes"). From there one could diff this new fixed revision against the last good revision
to asses the damage and see what else needs fixing (e.g. to regain consistency wrt. to JCR).

* Classification of tools between development / research/ experimental and production (customer
facing). The latter need a different level of support, maintenance, QE, documentation etc.
Possibly mark via documentation which is which. 
* Group commands from oak-run in namespaces. Assign a different namespace to each persistence
implementation in Oak. Let every implementation parse its own commands. Move commands closer
to their implementation and relieve oak-run from code bloat. See OAK-5437 for further details.




> Ease TarMK Operations
> ---------------------
>
>                 Key: OAK-5468
>                 URL: https://issues.apache.org/jira/browse/OAK-5468
>             Project: Jackrabbit Oak
>          Issue Type: Epic
>          Components: segment-tar
>            Reporter: Michael Dürig
>              Labels: management, monitoring, operations, tooling
>             Fix For: 1.8
>
>
> h2. Ease of TarMK Operations
> This epic is all about simplifying the operational aspects of the TarMK. Broadly this
can be broken down into the following three topics.
> h3. Monitoring
> * We need to improve monitoring for system load and health. It should be easy for operators
to figure out which parts of the TarMK are within safe bounds and and which are not.
> * Failures should be easy to diagnose and pinpoint the root cause. It should be evident
if and how a failures can be fixed by the operator. 
> h3. Management
> * Management tasks should be easy to use, clear and safe. It should be evident how to
achieve a certain task, what it means to execute it and what its parameters mean (discoverability).
Executing a task should no cause harm to the system because the system is not in the right
state (e.g. running restore concurrently to backup should be safe). 
> h3. Tooling
> * We need better tooling for diagnosing systems. E.g. Analysis of file stores (what content,
how much content, distribution over space and time, reachability, retention time, garbage,
etc.) Both, online and offline (i.e. post mortem).
> h2. Individual improvements
> Below is a list of items to address in no specific order. Let's start extracting them
into individual issues linked to this epic as we start tackling this. 
> h3. Monitoring
> * Throughput (e.g. time to commit, time to save, etc.)
> * Thrashing (setting on thereof)
> * SNFE (transient vs. catastrophic)
> * DSGC
> * FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, etc.)
> * Cold standby (progress, liveliness, latency, etc.)
> * ...
> h3. Management
> * Revisit backup/restore (OAK-5103, OAK-4866)
> * Coordination of management operations (ability to run conditionally, prevent them from
running concurrently, etc.)
> h3. Tooling
> * Progress monitor {{oak-run compact}}
> * Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove garbage left
by prior crash)
> * Bring {{oak-run check}} up to date. Address scalability and performance issues. Include
more useful statistics (e.g. node types, child node lists, content distribution, etc.)
> * Changes over time
> * Consolidation of various (unversioned) scripts into oak-run like 'node count script',
'node remove script'.
> * Allow connecting tools to a running instance.        
> * Snapshotting support: restartable stats collection (snapshot at certain revision, diff
to collect extras)
> * "Friendly" output formats that can be easily used by other tools (e.g. Unix tools,
Kibana, etc.)
> * Proper usage of stdin and stdout
> * Proper exit codes
> * Current gap in tooling is around the idea of healing a repository plagued with SNFEs,
bridge the gap between {{oak-run check}} and 'oak console node count script', provide options
to plug in the holes to restore the repository to a consistent state. One idea would be to
complement rolling back the segment store to the last good revision with rolling it forward
to a new and fixed good revisions. The simplest way of fixing is to just replace unreadable
items with empty ones (i.e. "plugging the holes"). From there one could diff this new fixed
revision against the last good revision to asses the damage and see what else needs fixing
(e.g. to regain consistency wrt. to JCR). 
> * Classification of tools between development / research/ experimental and production
(customer facing). The latter need a different level of support, maintenance, QE, documentation
etc. Possibly mark via documentation which is which. 
> * Group commands from oak-run in namespaces. Assign a different namespace to each persistence
implementation in Oak. Let every implementation parse its own commands. Move commands closer
to their implementation and relieve oak-run from code bloat. See OAK-5437 for further details.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message