jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Davide Giannella (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OAK-7193) DataStore: API to retrieve statistic (file headers, size estimation)
Date Tue, 04 Jun 2019 15:35:08 GMT

     [ https://issues.apache.org/jira/browse/OAK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Davide Giannella updated OAK-7193:
    Fix Version/s: 1.16.0

> DataStore: API to retrieve statistic (file headers, size estimation)
> --------------------------------------------------------------------
>                 Key: OAK-7193
>                 URL: https://issues.apache.org/jira/browse/OAK-7193
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: blob
>            Reporter: Thomas Mueller
>            Priority: Major
>             Fix For: 1.14.0, 1.16.0
> Extension of OAK-6254: in addition to retrieving the size, it would be good to retrieve
the estimated number and total size per file type. A simple (and in my view sufficient) solution
is to use the first few bytes ("magic numbers", 2 bytes should be enough) to get the file
type. That would allow to estimate, for example, the number of, and total size, of PDF files,
JPEG, Lucene index and so on. A histogram would be nice as well, but I think is not needed.
> To speed up calculation, the blob ID could be extended with the first 2 bytes of the
file content, that is: <hash>#<length>@<magic> where magic is the first
two bytes, in hex. That would allow to quickly get the data from the blob ids (no need to
actually read content).
> Sampling should be enough. The longer it takes, the more accurate the data. We could
store the data while doing datastore GC, in which case the returned data would be somewhat
stale; that's OK.

This message was sent by Atlassian JIRA

View raw message