jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-4412) Lucene hybrid index
Date Thu, 04 Aug 2016 05:03:20 GMT

    [ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405340#comment-15405340
] 

Chetan Mehrotra edited comment on OAK-4412 at 8/4/16 5:02 AM:
--------------------------------------------------------------

h3. Approach B - Lucene editor used both in async and sync mode

In current approach which make use of an Observer to update the local transient index most
of the work is being done in a single indexer thread which would be doing
# Diff of the changes nodestates
# Make out Lucene Document based on changes done
# Add the documents to the index

This might cause the indexer to again lag the current head depending on amount of write happening.
Instead of that we can change the approach and break the work in 2 parts

*Step 1 - LuceneIndexEditor used in sync mode*

We can move out the work done in #1 and #2 above to LuceneIndexEditor which would be invoked
in synchronous manner [0] as part of normal commit (similar to how current property index
editors are invoked). This editor would be backed by a different {{LuceneIndexWriter}} impl
which would add the Documents to the CommitInfo associated with current commit [1] instead
of adding them directly to the index.

This would thus parallize the expensive task of diffing and constructing the Lucene Document
from actual indexing (which is by design single threaded).

*Step 2 - Async local indexer*

To compliment the editor there would be an observer which listens for the changes. This observer

* For local changes would extract the Documents (prepared and added to CommitInfo associated
with change per previous step) and add them to queue for the writer of the matching index
* For external change it would run the editor and do the diff and prepare the document and
add them to the queue for the respective writer. 

Note that work here can be done on best effort basis - So if it takes time the indexer can
"drop" documents or say avoid doing indexing altogether for external diff. Those aspects can
be exposed for tuning.

*Step 3 - NRT Reader on Query Side*

On query side we would construct the reader from existing writer itself utilizing the [Lucene
NRT support|https://wiki.apache.org/lucene-java/NearRealtimeSearch] and with updated support
for MultiReader (done as part of OAK-4566) the query logic would also consider this reader
for any query evaluation. This would ensure that query get to see most recent results.

We can utilize all aspects for NRT (like skipping doing deletes as query engine would filter
out false results)

*Step 4 - Pruning of transient index*

Further we would need to periodically prune the transient indexes. This can be done by deleting
those documents which are older than last async index update cycle. So with each async index
update we can say that repository is indexed upto the time when async index update was started.
So we can use that time and remove those documents from index which are older than 2 cycles.
*Points to consider*

# Text extraction would be disabled for such transient indexing
# All this would be done on best effort basis. Note that even if index has some stale data
the QE would still evaluate and enforce the query constraint [2] and would filter out wrong
results.
# Each such transient index would be backed by FSDirectory. The FSDirectory would be cleaned
upon restart
# The editors need to ignore the reindex calls etc

[0] This would require change in current indexing logic where a given index definition can
only be used in either sync or in async mode but not in both (OAK-4641)
[1] CommitInfo is currently not accessible to index editors. So this would need to be changed
(OAK-4641, OAK-4640)
[2] Fulltext constraint would not be evaluated though. Do note that primary focus for such
hybrid index is property index


was (Author: chetanm):
h3. Approach B - Lucene editor used both in async and sync mode

In current approach which make use of an Observer to update the local transient index most
of the work is being done in a single indexer thread which would be doing
# Diff of the changes nodestates
# Make out Lucene Document based on changes done
# Add the documents to the index

This might cause the indexer to again lag the current head depending on amount of write happening.
Instead of that we can change the approach and break the work in 2 parts

*Step 1 - LuceneIndexEditor used in sync mode*

We can move out the work done in #1 and #2 above to LuceneIndexEditor which would be invoked
in synchronous manner [0] as part of normal commit (similar to how current property index
editors are invoked). This editor would be backed by a different {{LuceneIndexWriter}} impl
which would add the Documents to the CommitInfo associated with current commit [1] instead
of adding them directly to the index.

This would thus parallize the expensive task of diffing and constructing the Lucene Document
from actual indexing (which is by design single threaded).

*Step 2 - Async local indexer*

To compliment the editor there would be an observer which listens for the changes. This observer

* For local changes would extract the Documents (prepared and added to CommitInfo associated
with change per previous step) and add them to queue for the writer of the matching index
* For external change it would run the editor and do the diff and prepare the document and
add them to the queue for the respective writer. 

Note that work here can be done on best effort basis - So if it takes time the indexer can
"drop" documents or say avoid doing indexing altogether for external diff. Those aspects can
be exposed for tuning.

*Step 3 - NRT Reader on Query Side*

On query side we would construct the reader from existing writer itself utilizing the [Lucene
NRT support|https://wiki.apache.org/lucene-java/NearRealtimeSearch] and with updated support
for MultiReader (done as part of OAK-4566) the query logic would also consider this reader
for any query evaluation. This would ensure that query get to see most recent results.

We can utilize all aspects for NRT (like skipping doing deletes as query engine would filter
out false results)

*Step 4 - Pruning of transient index*

Further we would need to periodically prune the transient indexes. This can be done by deleting
those documents which are older than last async index update cycle. So with each async index
update we can say that repository is indexed upto the time when async index update was started.
So we can use that time and remove those documents from index which are older than 2 cycles.
*Points to consider*

# Text extraction would be disabled for such transient indexing
# All this would be done on best effort basis. Note that even if index has some stale data
the QE would still evaluate and enforce the query constraint [2] and would filter out wrong
results.
# Each such transient index would be backed by FSDirectory. The FSDirectory would be cleaned
upon restart
# The editors need to ignore the reindex calls etc

[0] This would require change in current indexing logic where a given index definition can
only be used in either sync or in async mode but not in both
[1] CommitInfo is currently not accessible to index editors. So this would need to be changed
[2] Fulltext constraint would not be evaluated though. Do note that primary focus for such
hybrid index is property index

> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Chetan Mehrotra
>             Fix For: 1.6
>
>         Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After performing some
stress-tests with a geo-distributed Mongo cluster, we've found out that updating property
indexes is a large part of the overall traffic.
> The asynchronous index would be an answer here (as the index update won't be made in
the client request thread), but the AEM requires the updates to be visible immediately in
order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a synchronous,
locally-stored counterpart that will persist only the data since the last Lucene background
reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local files. Once the
"main" Lucene index is being updated, the local index will be purged.
> Queries will use an union of results from the {{lucene}} and {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using an observer,
so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message