lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <>
Subject [jira] [Commented] (SOLR-6460) Keep transaction logs around longer
Date Wed, 24 Sep 2014 10:25:33 GMT


Renaud Delbru commented on SOLR-6460:


here is an initial analysis and proposal of the modifications of the UpdateLog for the CDCR
Most of the original workflow of the UpdateLog can be left untouched. It is necessary however
to keep the concept of "maximum number of records to keep" (except for the cleaning of old
transaction logs) in order to not interfere with the normal workflow.

h4. Cleaning of Old Transaction Logs

The logic to remove old tlog files should be modified so that it relies on pointers instead
of a limit defined by the maximum number of records to keep.
The UpdateLog should be the one in charge of keeping the list of pointers and of managing
their life-cycle (or to deleguate it to the LogReader which is presented next). Such a pointer,
denoted LogPointer, should be composed of a tlog file and of an associated file pointer.

h4. Log Reader

The UpdateLog must provide a log reader, denoted LogReader, that will be used by the CDC Replicator
to search, scan and read the update logs. The LogReader will wrap a LogPointer and hide its
management (e.g., instantiation, increment, release).

The operations that must be provided by the LogReader are:
* Scan: move LogPointer to next entry
* Read: read a log entry specified by the LogPointer
* Lookup: lookup a version number - this will be performed during the initialisation of the
CDC Replicator / election of a new leader, therefore rarely.

The LogReader must not only read olf tlog files, but also the new tlog file (i.e., transaction
log being written). This requires specific logic, since a LogReader can be exhausted at a
time t1 and have new entries available at a time t2.

h4. Log Index

In order to support efficient lookup of version numbers across a large number of tlog files,
we need a pre-computed index of version numbers across tlog files.
The index could be designed as a list of tlog files, associated with their lower and upper
bound in term of version numbers. The search will then read this index to find quickly the
tlog files containing a given version number, then read the tlog file to find the associated
However, a single tlog file can be large in certain scenarios. Therefore, we could add another
secondary index per tlog file. This index will contain a list of <version, pointer>
pairs. This will allow the LogReader to quickly find an entry without having to scan the full
tlog file. This index will be created and managed by the TransactionLog.
This secondary index however duplicates the version number for each log entry. A possible
optimisation is to modify the format of the transaction log so that the version number is
not stored as part of the log entry.

h4. Transaction Log

The TransactionLog class is opening the tlog file in the constructor. This could be problematic
with a large numbers of tlog files, as it will exhaust the file descriptors. One possible
solution is to create a subclass for read only mode that will not open the file in the constructor.
Instead, the file will be opened and closed on-demand by using the TransactionLog#LogReader.

The CDCR Update Logs will take care of converting old transaction log objects into a read-only
This has however indirect consequences on the initialisation of the UpdateLog, more precisely
in the recovery phase (#recoverFromLog), as the UpdateLog might write a commit (line 1418)
at the end of an old tlog during replaying.

h4. Integration within the UpdateHandler

We will have to extend the UpdateHandler constructor in order to have the possibility to switch
the UpdateLog implementation based on some configuration keys in the solrconfig.xml file.

> Keep transaction logs around longer
> -----------------------------------
>                 Key: SOLR-6460
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Yonik Seeley
> Transaction logs are currently deleted relatively quickly... but we need to keep them
around much longer to be used as a source for cross-datacenter recovery.  This will also be
useful in the future for enabling peer-sync to use more historical updates before falling
back to replication.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message