jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-5519) Skip problematic binaries instead of blocking indexing
Date Wed, 08 Nov 2017 14:51:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244076#comment-16244076

Thomas Mueller commented on OAK-5519:

> Going forward we can probably store some hidden property to mark such binaries to avoid
hitting them again (as cache is ephemeral)

That's what I thought as well, but actually, I think this is not needed. When adding a bad
pdf, text extraction will run, and then timeout, and then the text "TextExtractionError" is
stored in the fulltext index. Indexing continues. The thread will continue to consume 100%
CPU until the process is killed or the thread is stopped. However, after a restart, Oak will
not try to extract the same binary again, as indexing continued. Except if you upload the
same binary to somewhere else, but I guess that's rare.

> We can possibly store some more data/marker in special field which can then later be
queried to find out all such files which have not been indexed

Well, as you wrote, using the following query I can get the list of binaries where exaction
/jcr:root//*[jcr:contains(., 'textextractionerror')] 

Of course this includes binaries that contain this exact term, but I don't think that's a
big problem.

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the datastore
or any other error upon indexing one item from the repository that is outside the scope of
the indexer, it currently halts the indexing (lane). Thus one item (that maybe isn't important
to the users at all) can block the indexing of other, new content (that might be important
to users), and it always requires manual intervention  (which is also not easy and requires
oak experts).
> Instead, the item could be remembered in a known issue list, proper warnings given, and
indexing continue. Maintenance operations should be available to come back to reindex these,
or the indexer could automatically retry after some time. This would allow normal user activity
to go on without manual intervention, and solving the problem (if it's isolated to some binaries)
can be deferred.
> I think the line should probably be drawn for binary properties. Not sure if other JCR
property types could trigger a similar issue, and if a failure in them might actually warrant
a halt, as it could lead to an "incorrect" index, if these properties are important. But maybe
the line is simply a try & catch around "full text extraction".

This message was sent by Atlassian JIRA

View raw message