jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davide Giannella (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OAK-3092) Cache recently extracted text to avoid duplicate extraction
Date Mon, 03 Aug 2015 11:34:06 GMT

     [ https://issues.apache.org/jira/browse/OAK-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Davide Giannella updated OAK-3092:
    Fix Version/s:     (was: 1.3.4)

> Cache recently extracted text to avoid duplicate extraction
> -----------------------------------------------------------
>                 Key: OAK-3092
>                 URL: https://issues.apache.org/jira/browse/OAK-3092
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.5
> It can happen that text can be extracted from same binary multiple times in a given indexing
cycle. This can happen due to 2 reasons
> # Multiple Lucene indexes indexing same node - A system might have multiple Lucene indexes
e.g. a global Lucene index and an index for specific nodeType. In a given indexing cycle same
file would be picked up by both index definition and both would extract same text
> # Aggregation - With Index time aggregation same file get picked up multiple times due
to aggregation rules
> To avoid the wasted effort for duplicate text extraction from same file in a given indexing
cycle it would be better to have an expiring cache which can hold on to extracted text content
for some time. The cache should have following features
> # Limit on total size
> # Way to expire the content using [Timed Evicition|https://code.google.com/p/guava-libraries/wiki/CachesExplained#Timed_Eviction]
- As chances of same file getting picked up are high only for a given indexing cycle it would
be better to expire the cache entries after some time to avoid hogging memory unnecessarily

> Such a cache would provide following benefit
> # Avoid duplicate text extraction - Text extraction is costly and has to be minimized
on critical path of {{indexEditor}}
> # Avoid expensive IO specially if binary content are to be fetched from a remote {{BlobStore}}

This message was sent by Atlassian JIRA

View raw message