jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-4412) Lucene hybrid index
Date Thu, 15 Sep 2016 06:43:20 GMT

    [ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467219#comment-15467219
] 

Chetan Mehrotra edited comment on OAK-4412 at 9/15/16 6:42 AM:
---------------------------------------------------------------

Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for review.

h3. A - Purpose

Hybrid index provides 2 indexing modes

h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync commit and
would be added to a *local* index asynchronously where the IndexReader would be refreshed
with _refresh interval_ of 1 sec

h5. Benefit - Reduced delay between content change and it showing up in query result 
In this mode the primary aim is to reduce the time interval between any write happening to
content and before it gets reflected in queue result. With current async indexing the latency
can be from 5 sec to 1 minute depending on cluster load and how fast is async indexing. With
nrt mode it would be ensured that even if async indexer does not catch up fast the local index
would pick up the changes and hence recent addition would reflect in query result.

h5. Benefit - Reduced storage in Mongo/RDB case

The indexes would be stored in Lucene and hence consume much lesser space compared to property
index. Query execution would be lot faster as indexes are copied locally and hence query evaluation
would involve lot lesser remote access

h5. Benefit - Better handling for index update/rebuild

Currently if an async index needs to be rebuild (say due to some corruption) or needs to be
updated to index more stuff the whole async indexing process gets block for that index reindexing
to get completed. This results in result get lot more stale e.g. if reindexing takes 2-3 hrs
then async index result would lag behind repository state by that much time. 

With nrt index this would improve as even if async is blocked the local index would get updates
and query result would not get stale.

So doing such index update on live system becomes easier

h4. sync
In this mode the lucene document would be added to index and IndexReader would be *immediately*
refreshed. Functionally this would be similar to property index. This mode has lower performance
compared to {{nrt}}. 

This mode should be used for those cases where code expects changes made to session immediately
reflected in the query. So if a session set _/a/b/@foo_ to _bar_ and just after session save
performs a query for 'bar' and expects /a/n/@foo to be part of result set then this mode should
be used. 

Performance wise this mode is slower and slows down writes compared to 'nrt'

The indexes created under hybrid index are local and maintain index data between last async
index cycle to most recent commit. Any search would be performed via MultiReader with readers
from local index and another from index built as part of async indexing.


h3. B - Usage

To enable this mode for any index you need to make the {{async}} property as a multi value
property with following values

* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode

{{LuceneIndexProviderService}} - Provides some tuning configuration which can be modfied as
per setup requirements


h4. Implementation Detail

Most of the new code lives under {{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}}
package. For any commit involving any index definition marked with {{nrt}} or {{sync}} {{LuceneIndexEditorProvider}}
would return a {{LuceneIndexEditor}} backed by {{LocalIndexWriterFactory}}. This factory would
use {{LocalIndexWriter}} and stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}.
This holder instance is stored as part of {{CommitContext}} (which is stored in {{CommitInfo}}
associated with the commit).

Once merge is done for that commit the change is picked by {{LocalIndexObserver}} (a sync
observer). This observer would then look for {{LuceneDocumentHolder}} and if found would process
the {{LuceneDoc}} stored in it

* For documents belonging to {{nrt}} mode it would add the docs to {{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document to {{NRTIndex}}
configured for that index

{{DocumentQueue}} asynchronously picks up the docs from the queue and then write them to the
index. While adding docs to the queue it can block for small time and if queue remains full
then doc would be _dropped and not added to queue_. So indexing here is on best effort basis

*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching {{NRTIndex}} which
is constructed from {{NRTIndexFactory}}. Whenever a new {{IndexNode}} instance is created
as a result of change in async index (via {{IndexTracker}}) the factory would create a new
{{NRTIndex}} for that. It keeps maximum 2 instance of {{NRTIndex}} and closes and garbage
collect older onces. So a {{NRTIndex}} would only have index data for the data indexed between
2 consecutive async indexing cycle.

{{NRTIndex}} provides access to {{IndexWriter}} which is used by {{DocumentQueue}} to write
documents to it. It also creates {{IndexReader}} which is obtained from {{IndexWriter}} making
use of [Lucene NRT Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]

{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines how and when
the reader should be refreshed. The policy instance is also made aware of the changes done
to index. For {{nrt}} indexes {{TimedRefreshPolicy}} is used which by default refreshes the
reader after 1 sec delay. For {{sync}} index {{RefreshOnWritePolicy}} is used which refreshes
the reader after any writes

*Avoiding Deletes*

The indexing logic avoids deleting any document in Lucene index. So if /a/b/@foo is updated
say 3 times between 2 async index cycle

* /a/b/@foo = 'x'
* /a/b/@foo = 'y'
* /a/b/@foo = 'z'

Then Lucene index would have 3 documents added (no updated). Then {{LucenePropertyIndex}}
would match either of 3 depending on query criteria. Say if query is for foo='x' the {{LucenePropertyIndex}}
would return /a/b as part of Cursor. The cursor used is a unique cursor so if Lucene returns
three documents then only first one would result in entry to cursor and others would be ignored

Later query engine (QE) would evaluate the /a/b against the query criteria as per {{ContentSession}}
revision and if node value at that time matches then result would be returned to end user
otherwise it would be skipped. So if per current root NodeState /a/b@foo='x' and for a query
on foo='y' LucenePropertyIndex returns /a/b then QE would filter out that result

So in no case correctness of the result would get affected. This allows us to avoid deleting
documents in Lucene index.

h3. C - Benchmark

A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It creates multiple
indexes (_numOfIndexes_ = 10) to simulate a system having multiple indexes defined and then
creates node with property {{foo}} being set with value as per enum _Status_. Each thread
then creates nodes in breadth first fashion (defaults to 5 child node per node and then for
each child node). 

In addition there is a {{Searcher}} thread which queries for different values and a {{Mutator}}
which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used otherwise property
index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid compression
which slows down the searches (OAK-1737)

{noformat}
java  -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark --concurrency=5
HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}

_Results would be posted soon_

h3. D -Pending Feature Work

* Support for listening to external changes and then update the {{nrt}} indexes based on those
changes - Tracked via OAK-4808
* JMX MBean around NRTIndexFactory to see rate of change etc - OAK-4809



was (Author: chetanm):
Planned feature work is now done and [patch|^OAK-4412-v1.diff] is ready for review.

h3. A - Purpose

Hybrid index provides 2 indexing modes

h4. nrt
In this mode for each commit Lucene Documents would be created as part of sync commit and
would be added to a *local* index asynchronously where the IndexReader would be refreshed
with _refresh interval_ of 1 sec

h5. Benefit - Reduced delay between content change and it showing up in query result 
In this mode the primary aim is to reduce the time interval between any write happening to
content and before it gets reflected in queue result. With current async indexing the latency
can be from 5 sec to 1 minute depending on cluster load and how fast is async indexing. With
nrt mode it would be ensured that even if async indexer does not catch up fast the local index
would pick up the changes and hence recent addition would reflect in query result.

h5. Benefit - Reduced storage in Mongo/RDB case

The indexes would be stored in Lucene and hence consume much lesser space compared to property
index. Query execution would be lot faster as indexes are copied locally and hence query evaluation
would involve lot lesser remote access

h5. Benefit - Better handling for index update/rebuild

Currently if an async index needs to be rebuild (say due to some corruption) or needs to be
updated to index more stuff the whole async indexing process gets block for that index reindexing
to get completed. This results in result get lot more stale e.g. if reindexing takes 2-3 hrs
then async index result would lag behind repository state by that much time. 

With nrt index this would improve as even if async is blocked the local index would get updates
and query result would not get stale.

So doing such index update on live system becomes easier

h4. sync
In this mode the lucene document would be added to index and IndexReader would be *immediately*
refreshed. Functionally this would be similar to property index. This mode has lower performance
compared to {{nrt}}. 

This mode should be used for those cases where code expects changes made to session immediately
reflected in the query. So if a session set _/a/b/@foo_ to _bar_ and just after session save
performs a query for 'bar' and expects /a/n/@foo to be part of result set then this mode should
be used. 

Performance wise this mode is slower and slows down writes compared to 'nrt'

The indexes created under hybrid index are local and maintain index data between last async
index cycle to most recent commit. Any search would be performed via MultiReader with readers
from local index and another from index built as part of async indexing.


h3. B - Usage

To enable this mode for any index you need to make the {{async}} property as a multi value
property with following values

* {{async}} = [{{async}}, {{nrt}}] - Enables the NRT mode
* {{async}} = [{{async}}, {{sync}}] - Enables the sync mode

{{LuceneIndexProviderService}} - Provides some tuning configuration which can be modfied as
per setup requirements


h4. Implementation Detail

Most of the new code lives under {{org.apache.jackrabbit.oak.plugins.index.lucene.hybrid}}
package. For any commit involving any index definition marked with {{nrt}} or {{sync}} {{LuceneIndexEditorProvider}}
would return a {{LuceneIndexEditor}} backed by {{LocalIndexWriterFactory}}. This factory would
use {{LocalIndexWriter}} and stores the prepared {{LuceneDoc}} in {{LuceneDocumentHolder}}.
This holder instance is stored as part of {{CommitContext}} (which is stored in {{CommitInfo}}
associated with the commit).

Once merge is done for that commit the change is picked by {{LocalIndexObserver}} (a sync
observer). This observer would then look for {{LuceneDocumentHolder}} and if found would process
the {{LuceneDoc}} stored in it

* For documents belonging to {{nrt}} mode it would add the docs to {{DocumentQueue}}
* For documents belonging ti {{sync}} mode it would directly write the document to {{NRTIndex}}
configured for that index

{{DocumentQueue}} asynchronously picks up the docs from the queue and then write them to the
index. While adding docs to the queue it can block for small time and if queue remains full
then doc would be _dropped and not added to queue_. So indexing here is on best effort basis

*NRTIndex*
On indexing side each index (represented by {{IndexNode}}) has a matching {{NRTIndex}} which
is constructed from {{NRTIndexFactory}}. Whenever a new {{IndexNode}} instance is created
as a result of change in async index (via {{IndexTracker}}) the factory would create a new
{{NRTIndex}} for that. It keeps maximum 2 instance of {{NRTIndex}} and closes and garbage
collect older onces. So a {{NRTIndex}} would only have index data for the data indexed between
2 consecutive async indexing cycle.

{{NRTIndex}} provides access to {{IndexWriter}} which is used by {{DocumentQueue}} to write
documents to it. It also creates {{IndexReader}} which is obtained from {{IndexWriter}} making
use of [Lucene NRT Support|http://wiki.apache.org/lucene-java/NearRealtimeSearch]

{{NRTIndex}} also provides access to {{ReaderRefreshPolicy}} which determines how and when
the reader should be refreshed. The policy instance is also made aware of the changes done
to index. For {{nrt}} indexes {{TimedRefreshPolicy}} is used which by default refreshes the
reader after 1 sec delay. For {{sync}} index {{RefreshOnWritePolicy}} is used which refreshes
the reader after any writes

*Avoiding Deletes*

The indexing logic avoids deleting any document in Lucene index. So if /a/b/@foo is updated
say 3 times between 2 async index cycle

* /a/b/@foo = 'x'
* /a/b/@foo = 'y'
* /a/b/@foo = 'z'

Then Lucene index would have 3 documents added (no updated). Then {{LucenePropertyIndex}}
would match either of 3 depending on query criteria. Say if query is for foo='x' the {{LucenePropertyIndex}}
would return /a/b as part of Cursor. The cursor used is a unique cursor so if Lucene returns
three documents then only first one would result in entry to cursor and others would be ignored

Later query engine (QE) would evaluate the /a/b against the query criteria as per {{ContentSession}}
revision and if node value at that time matches then result would be returned to end user
otherwise it would be skipped. So if per current root NodeState /a/b@foo='x' and for a query
on foo='y' LucenePropertyIndex returns /a/b then QE would filter out that result

So in no case correctness of the result would get affected. This allows us to avoid deleting
documents in Lucene index.

h3. C - Benchmark

A benchmark has been implemented in oak-run under {{HybridIndexTest}}. It creates multiple
indexes (_numOfIndexes_ = 10) to simulate a system having multiple indexes defined and then
creates node with property {{foo}} being set with value as per enum _Status_. Each thread
then creates nodes in breadth first fashion (defaults to 5 child node per node and then for
each child node). 

In addition there is a {{Searcher}} thread which queries for different values and a {{Mutator}}
which modifies the values
* refreshDeltaMillis - 1000 - Time delay between reader reopen for nrt
* asyncInterval - 5 - Time in seconds for async indexer
* queueSize - 1000 - Size of queue used by {{DocumentQueue}}
* hybridIndexEnabled - Boolean flag. If set to true hybrid index would be used otherwise property
index would be used
* indexingMode - Defaults to nrt - [nrt/sync] - Which mode to use if hybridIndexEnabled
* useOakCodec - Boolean flag if set to true {{oakCodec}} would be used to avoid compression
which slows down the searches (OAK-1737)

{noformat}
java  -DhybridIndexEnabled=true -DindexingMode=nrt -jar oak-run*.jar benchmark --concurrency=5
HybridIndexTest Oak-Mongo-FDS Oak-Segment-Tar-FDS
{noformat}

_Results would be posted soon_

h3. D -Pending Feature Work

* Support for listening to external changes and then update the {{nrt}} indexes based on those
changes
* JMX MBean around NRTIndexFactory to see rate of change etc


> Lucene hybrid index
> -------------------
>
>                 Key: OAK-4412
>                 URL: https://issues.apache.org/jira/browse/OAK-4412
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene
>            Reporter: Tomek Rękawek
>            Assignee: Chetan Mehrotra
>             Fix For: 1.6
>
>         Attachments: OAK-4412-v1.diff, OAK-4412.patch, hybrid-benchmark.sh, hybrid-result-v1.txt
>
>
> When running Oak in a cluster, each write operation is expensive. After performing some
stress-tests with a geo-distributed Mongo cluster, we've found out that updating property
indexes is a large part of the overall traffic.
> The asynchronous index would be an answer here (as the index update won't be made in
the client request thread), but the AEM requires the updates to be visible immediately in
order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a synchronous,
locally-stored counterpart that will persist only the data since the last Lucene background
reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local files. Once the
"main" Lucene index is being updated, the local index will be purged.
> Queries will use an union of results from the {{lucene}} and {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using an observer,
so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message