jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julian Reschke (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-4780) VersionGarbageCollector should be able to run incrementally
Date Mon, 20 Feb 2017 15:50:44 GMT

    [ https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874699#comment-15874699
] 

Julian Reschke edited comment on OAK-4780 at 2/20/17 3:50 PM:
--------------------------------------------------------------

Assuming there is the following in place:
1. Time/Revision of last successful run (as patch proposed in OAK-3070)
2. Reset of _deletedOnce flags for nodes that have become visible/valid again (OAK-5704)
3. remove leaf nodes early (implemented via OAK-5571)

An incremental strategy needs to adapt to various situations:

Case A: the RGC starts and finds the Timestamp(TS)/Revision from the last successful run
 * A.1: the TS is from last night which should be the 99% case
 * A.2: the TS is from several days/weeks/months before. Possible causes:
   - a) the cluster was down for that time period
   - b) the cluster was down every time the RGC was supposed to run (e.g. other system
        maintenance colliding with daily maintenance window)
   - c) the cluster is in a state where the RGC never finishes successfully
   - d) a backup of the database was restored  

Case B: the RGC starts and does not find any data from a previous run. That means
 * B.1: its a new installation and it runs the first time
 * B.2: its an updated installation where an earlier version of the RGC
   - a) has run successfully before
   - b) never did run (successfully)

The question is when to apply which strategy to make RGC reach the stable case A.1 again.

Assuming a common customer is one who has his system running 24/7 (more or less) and
has configured a "maintenance window" periodically (every day/night or every weekend) 
where tasks as RGC are supposed to run. During this time interval, load on the system 
by RGC is acceptable. On other times not.

The degenerate case is where the system produces so much to clean up that the maintenance
time interval is insufficient to perform the necessary tasks. The installation needs
either a larger maintenance window or more CPU/database power or another RGC strategy
than the one discussed here. --> TODO

Let's focus on how to do the task under the assumption that it is possible to do A.1 in
the time given, plus a safety margin of XX% for fluctuations. And that XX% needs to be 
large enough to allow the system recovery from the other scenarios.

Example: the maintenance window has a safety margin of 100%, meaning that its duration
is twice as large as the average work that needs to be done by the RGC after a
maintenance period (let's assume daily).

100% margin means that the RGC can do 2 days of work during on maintenance run. Looking
at the scenarios above, this means in
 * A.1: RGC is finished in half the time on average
 * A.2.a: RGC will catch up all work, next run will find A.1 case
 * A.2.b: RGC is able to catch up one day per run. If it did not run for N days, 
          it will take N-1 days to reach A.1
 * A.2.c: panic mode, red alert
 * A.2.d: will reach A.1 on the next run
 * B.1: will reach A.1 on the next run
 * B.2.a: could reach A.1, question is how
 * B.2.b: will need several maintenance runs to catch up to A.1, depending on amount 
          of nodes to clean up, but how exactly?

And then there is the case where the 100% margin on maintenance time is only valid on 
*average* and that there can be more cleanup work after a busy day. It could be more
work than what is possible to do in one maintenance run.


Folding A and B cases:

If the TS information is missing, the RGC should select the oldest (_modified) node 
with _deletedOnce and use that as TS (maybe an hour before that).

TODO: find out the cost of such a query on our large scale test clusters

With A/B folded, RGC start with a last TS and a MaxRevisionAge setting that gives
a time duration TD. In the normal case A.1, TD < 24 hours (or whatever
the maintenance interval is). In the abnormal cases, this can be several times
as long.

Assuming the maintenance time is configured correctly, the chunks that can be safely
tackled by RGC are 1 maintenance interval. For ease of description, let's assume this
to be a day. So, we have
{code}
   MP := Maintenance Period (e.g. 24 hours)
   MD := Allowed Maintenance Duration (e.g. 4 hours)
   MStart := time RGC started 
   MEnd := MStart + MD
   LMax := max duration of run (0 at start)

   while Now() < MEnd - LMax:
	TS := as persisted or oldest _deletedOnce node
	TD := MStart - MaxRevisionAge - TS
	TM := Min( TD, MP )
	collectAndInspectNopes({ _deletedOnce: 1, TS < _modified < ( TS + TM ) })
	save TS := TS + TM
	LMax := Max( LMax, duration of this run)
{code}
This tries to cleanup nodes in the time interval of 1 Maintenance period, repeatedly,
until the time for maintenance runs out (by taking the max time of each iteration 
into account).

Example: the daily maintenance has 2 hours, the last run was 3 days ago. In the first
iteration, the RGC would collect nodes deleted and modified during time [now-3, now-2[.
That run takes 45 minutes. So, there is time during this maintenance to run a 2nd time,
cleaning up the interval [now-2, now-1[. This takes 40 minutes. 1 hour and 25 minutes
have passed, leaving 35 minutes in maintenance which seems not enough for another run.

During the next night, the RGC will then clean up intervals [now-2, now-1[ and 
[now-1, now[ and catch up.


This strategy has the following weaknesses:
1. if the amount of work for a particular interval is very large (fluctuations in delete)
   the cleanup will not finish before the maintenance interval expires.
2. if the first RGC iteration takes more than 50% of the maintenance duration, a second
   iteration will never be done. This means the RGC will only clean up one maintenance 
   period. If it is behind (A.2,B), it will therefore never catch up.

It seems therefore desirable to take smaller intervals than the maintenance period. How
small depends on the cost of scanning the database for all nodes matching the RGC
criteria. E.g. what we should measure is:

TODO:
 S1: Scan a large database for matching nodes in interval MI
 S2: Scan for intervals MI/2, 2 times
 S5: MI/5, 5 times
 S10: MI/10, 10 times

which should give some numbers to see what divisor is a good choice.

TODO: collecting the nodes happens with iteration over the result sets, so RGC can
detect that time intervals need to be adjusted. Intervals should 
 * shrink when "too many" candidates are detected in one iteration
 * grow when an iteration uses unexpectedly little time

 
















was (Author: stefan.eissing):
Assuming there is the following in place:
1. Time/Revision of last successful run (as patch proposed in OAK-3070)
2. Reset of _deletedOnce flags for nodes that have become visible/valid again (not ticket)
3. remove leaf nodes early (implemented via OAK-5571)

An incremental strategy needs to adapt to various situations:

Case A: the RGC starts and finds the Timestamp(TS)/Revision from the last successful run
 * A.1: the TS is from last night which should be the 99% case
 * A.2: the TS is from several days/weeks/months before. Possible causes:
   - a) the cluster was down for that time period
   - b) the cluster was down every time the RGC was supposed to run (e.g. other system
        maintenance colliding with daily maintenance window)
   - c) the cluster is in a state where the RGC never finishes successfully
   - d) a backup of the database was restored  

Case B: the RGC starts and does not find any data from a previous run. That means
 * B.1: its a new installation and it runs the first time
 * B.2: its an updated installation where an earlier version of the RGC
   - a) has run successfully before
   - b) never did run (successfully)

The question is when to apply which strategy to make RGC reach the stable case A.1 again.

Assuming a common customer is one who has his system running 24/7 (more or less) and
has configured a "maintenance window" periodically (every day/night or every weekend) 
where tasks as RGC are supposed to run. During this time interval, load on the system 
by RGC is acceptable. On other times not.

The degenerate case is where the system produces so much to clean up that the maintenance
time interval is insufficient to perform the necessary tasks. The installation needs
either a larger maintenance window or more CPU/database power or another RGC strategy
than the one discussed here. --> TODO

Let's focus on how to do the task under the assumption that it is possible to do A.1 in
the time given, plus a safety margin of XX% for fluctuations. And that XX% needs to be 
large enough to allow the system recovery from the other scenarios.

Example: the maintenance window has a safety margin of 100%, meaning that its duration
is twice as large as the average work that needs to be done by the RGC after a
maintenance period (let's assume daily).

100% margin means that the RGC can do 2 days of work during on maintenance run. Looking
at the scenarios above, this means in
 * A.1: RGC is finished in half the time on average
 * A.2.a: RGC will catch up all work, next run will find A.1 case
 * A.2.b: RGC is able to catch up one day per run. If it did not run for N days, 
          it will take N-1 days to reach A.1
 * A.2.c: panic mode, red alert
 * A.2.d: will reach A.1 on the next run
 * B.1: will reach A.1 on the next run
 * B.2.a: could reach A.1, question is how
 * B.2.b: will need several maintenance runs to catch up to A.1, depending on amount 
          of nodes to clean up, but how exactly?

And then there is the case where the 100% margin on maintenance time is only valid on 
*average* and that there can be more cleanup work after a busy day. It could be more
work than what is possible to do in one maintenance run.


Folding A and B cases:

If the TS information is missing, the RGC should select the oldest (_modified) node 
with _deletedOnce and use that as TS (maybe an hour before that).

TODO: find out the cost of such a query on our large scale test clusters

With A/B folded, RGC start with a last TS and a MaxRevisionAge setting that gives
a time duration TD. In the normal case A.1, TD < 24 hours (or whatever
the maintenance interval is). In the abnormal cases, this can be several times
as long.

Assuming the maintenance time is configured correctly, the chunks that can be safely
tackled by RGC are 1 maintenance interval. For ease of description, let's assume this
to be a day. So, we have
{code}
   MP := Maintenance Period (e.g. 24 hours)
   MD := Allowed Maintenance Duration (e.g. 4 hours)
   MStart := time RGC started 
   MEnd := MStart + MD
   LMax := max duration of run (0 at start)

   while Now() < MEnd - LMax:
	TS := as persisted or oldest _deletedOnce node
	TD := MStart - MaxRevisionAge - TS
	TM := Min( TD, MP )
	collectAndInspectNopes({ _deletedOnce: 1, TS < _modified < ( TS + TM ) })
	save TS := TS + TM
	LMax := Max( LMax, duration of this run)
{code}
This tries to cleanup nodes in the time interval of 1 Maintenance period, repeatedly,
until the time for maintenance runs out (by taking the max time of each iteration 
into account).

Example: the daily maintenance has 2 hours, the last run was 3 days ago. In the first
iteration, the RGC would collect nodes deleted and modified during time [now-3, now-2[.
That run takes 45 minutes. So, there is time during this maintenance to run a 2nd time,
cleaning up the interval [now-2, now-1[. This takes 40 minutes. 1 hour and 25 minutes
have passed, leaving 35 minutes in maintenance which seems not enough for another run.

During the next night, the RGC will then clean up intervals [now-2, now-1[ and 
[now-1, now[ and catch up.


This strategy has the following weaknesses:
1. if the amount of work for a particular interval is very large (fluctuations in delete)
   the cleanup will not finish before the maintenance interval expires.
2. if the first RGC iteration takes more than 50% of the maintenance duration, a second
   iteration will never be done. This means the RGC will only clean up one maintenance 
   period. If it is behind (A.2,B), it will therefore never catch up.

It seems therefore desirable to take smaller intervals than the maintenance period. How
small depends on the cost of scanning the database for all nodes matching the RGC
criteria. E.g. what we should measure is:

TODO:
 S1: Scan a large database for matching nodes in interval MI
 S2: Scan for intervals MI/2, 2 times
 S5: MI/5, 5 times
 S10: MI/10, 10 times

which should give some numbers to see what divisor is a good choice.

TODO: collecting the nodes happens with iteration over the result sets, so RGC can
detect that time intervals need to be adjusted. Intervals should 
 * shrink when "too many" candidates are detected in one iteration
 * grow when an iteration uses unexpectedly little time

 















> VersionGarbageCollector should be able to run incrementally
> -----------------------------------------------------------
>
>                 Key: OAK-4780
>                 URL: https://issues.apache.org/jira/browse/OAK-4780
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: core, documentmk
>            Reporter: Julian Reschke
>         Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been successfully
finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is interrupted
during the path collection phase, maybe due to other maintenance tasks. On the next run, the
number of paths to be collected will be even bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in chunks;
maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message