jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Ryan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (OAK-8421) Add oak-run option to dump extracted text for all binaries
Date Thu, 20 Jun 2019 14:37:00 GMT
Matt Ryan created OAK-8421:
------------------------------

             Summary: Add oak-run option to dump extracted text for all binaries
                 Key: OAK-8421
                 URL: https://issues.apache.org/jira/browse/OAK-8421
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: indexing, oak-run
    Affects Versions: 1.14.0
            Reporter: Matt Ryan


If you use {{oak-run}} to dump the extracted text from binary properties, during the "generate"
step inlined binaries are skipped and not placed into the output CSV file.  Then during either
the "extract" or "populate" steps which use this CSV the extracted text from those binaries
will not be included in the dump.

It would be nice to include an option to the "generate" step to tell {{oak-run}} to also include
inlined binaries in the CSV.  Then, for this to work, the "extract" step would also need
the node store parameter so it could get the text from the node store if the binary is inlined.

I'm not sure about the "populate" step, it might need this too.  It tries to get the text
directly from the index, so it would depend if inlined binaries also store their extracted
text in the index.  I would assume they do, so maybe the "populate" step wouldn't need to
be modified.

The {{oak-run}} documentation would also need to be updated; specifically this page:  [https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message