jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OAK-2787) Faster multi threaded indexing / text extraction for binary content
Date Mon, 20 Nov 2017 11:23:00 GMT

     [ https://issues.apache.org/jira/browse/OAK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chetan Mehrotra updated OAK-2787:
    Fix Version/s:     (was: 1.8)

> Faster multi threaded indexing / text extraction for binary content
> -------------------------------------------------------------------
>                 Key: OAK-2787
>                 URL: https://issues.apache.org/jira/browse/OAK-2787
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: lucene
>            Reporter: Chetan Mehrotra
>             Fix For: 1.10
> With Lucene based indexing the indexing process is single threaded. This hamper the indexing
of binary content as on a multi processor system only single thread can be used to perform
the indexing
> [~ianeboston] Suggested a possible approach [1] involving a 2 phase indexing
> # In first phase detect the nodes to be indexed and start the full text extraction of
the binary content. Post extraction save the binary token stream back to the node as a hidden
data. In this phase the node properties can still be indexed and a marker field would be added
to indicate the fulltext index is still pending
> # Later in 2nd phase look for all such Lucene docs and then update them with the saved
token stream
> This would allow the text extraction logic to be decouple from Lucene indexing logic
> [1] http://markmail.org/thread/2w5o4bwqsosb6esu

This message was sent by Atlassian JIRA

View raw message