jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-2892) Speed up lucene indexing post migration by pre extracting the text content from binaries
Date Fri, 10 Jul 2015 11:50:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622175#comment-14622175

Chetan Mehrotra commented on OAK-2892:

Done initial implementation in http://svn.apache.org/r1690247

[~tmueller] Can you review the commit to see if the comments you made are addressed. if anything
to be changed there then let me know

> Speed up lucene indexing post migration by pre extracting the text content from binaries
> ----------------------------------------------------------------------------------------
>                 Key: OAK-2892
>                 URL: https://issues.apache.org/jira/browse/OAK-2892
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: lucene, run
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.3.3, 1.0.18
> While migrating large repositories say having 3 M docs (250k PDF) Lucene indexing takes
long time to complete (at time 4 days!). Currently the text extraction logic is coupled with
Lucene indexing and hence is performed in a single threaded mode which slows down the indexing
process. Further if the reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for a given
Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in repository and
would then traverse the repository and look for existing binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create this store
which has the text already extracted. Then post startup we need to wire up the ExtractedTextProvider
instance (which is backed by the BlobStore populated before) and indexing logic can just get
content from that. This would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66

This message was sent by Atlassian JIRA

View raw message