jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gardner Buchanan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-2808) Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
Date Mon, 02 May 2016 13:44:12 GMT

    [ https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266629#comment-15266629

Gardner Buchanan commented on OAK-2808:

bq. BlobStore which does not treat all binary as equal would certainly be the best solution.

I would advocate an alternate parallel blobstore that uses an approach based on the repository
path eg: compute the MD5 of the path, place the blob accordingly. Merely placing index files
within the datastore according to their path rather than their content would immediately alleviate
the bloat problem with indexes simply because the file contents could be overwritten in place.
It might not even be necessary to do anything fancy about garbage collecting these.

GC, when it is needed, can take the same pattern as with the content based approach – traverse
the repo, make a list of the paths and their MD5 sums – traverse the blobstore and keep
the items on the list.

I would also like to see the choice of blob store implementation made at the repository level,
maybe via a mixin or heritable property. Some other application level functionality could
benefit from cleanup in the same way as index binaries, such as workflow payloads and replication
durbo files. The approach used for indexes should generalize to these other use-cases.

> Active deletion of 'deleted' Lucene index files from DataStore without relying on full
scale Blob GC
> ----------------------------------------------------------------------------------------------------
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.6
>         Attachments: OAK-2808-1.patch, copyonread-stats.png
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl

This message was sent by Atlassian JIRA

View raw message