jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-6353) Use Document order traversal for reindexing performed on DocumentNodeStore setups
Date Mon, 18 Dec 2017 06:45:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294556#comment-16294556
] 

Chetan Mehrotra edited comment on OAK-6353 at 12/18/17 6:44 AM:
----------------------------------------------------------------

With new Document order traversal based indexing significant performance improvements were
seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M assets) earlier
indexing completed in 13.66 h. Compared to that document order based indexing completed in
3.469 h. 

With this initial planned implementation is done. Specific issues can later be opened for
further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may reduce the time
to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This can be done
by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and just write the
sorted files

*Usage*

This mode can be enabled for Mongo based setup via cli argument {{--doc-traversal-mode}}

This indexing mode requires quite a bit of local disk space to store all the NodeState in
json format. For 200GB Mongo repo it required 100GB of local disk space to keep the NodeState
json and also for performing external sort on that

Also documents need to be updated


was (Author: chetanm):
With new Document order traversal based indexing significant performance improvements were
seen. 

For a large repo (255M Mongo Docs, 66M nodes under /content and having 4.2M assets) earlier
indexing completed in 13.66 h. Compared to that document order based indexing completed in
3.469 h. 

With this initial planned implementation is done. Specific issues can later be opened for
further improvements. Possible future enhancements

# Prefetch the previous documents before doing Mongo traversal - This may reduce the time
to resolve the NodeDocument to NodeState
# Mongo query optimizations
## Avoid fetching nodes under hidden paths at all
## Only fetch those documents from Mongo which are under included paths - This can be done
by using javascript function
# Sorting optimization - Sort the batch in memory as nodes are being read and just write the
sorted files

Also documents need to be updated

> Use Document order traversal for reindexing performed on DocumentNodeStore setups
> ---------------------------------------------------------------------------------
>
>                 Key: OAK-6353
>                 URL: https://issues.apache.org/jira/browse/OAK-6353
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.7.13, 1.8
>
>         Attachments: OAK-6353-v1.patch, OAK-6353-v2.patch
>
>
> [~tmueller] suggested [here|https://issues.apache.org/jira/browse/OAK-6246?focusedCommentId=16034442&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16034442]
that document order traversal can be faster compared to current mode of path based traversal.
Initial test indicate that such a traversal can be order of magnitude faster. 
> So this task is meant to implement such an approach and see if it can be a viable indexing
mode used for DocumentNodeStore based setups



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message