jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Mueller (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-5519) Skip problematic binaries instead of blocking indexing
Date Thu, 09 Nov 2017 17:08:01 GMT

    [ https://issues.apache.org/jira/browse/OAK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246063#comment-16246063
] 

Thomas Mueller edited comment on OAK-5519 at 11/9/17 5:07 PM:
--------------------------------------------------------------

http://svn.apache.org/r1814745

[~chetanm] I have incorporated your requests. Features:
* No OSGi / JMX configuration right now, but "emergency" configuration via system properties
(for example, ability to disable this feature, set timeout,...)
* Timeout is 60 seconds.
* Timed out extraction is now stored to a file in the repository / index directory, in a properties
file named "textExtractionTimeout.properties". Example content below. This file is read on
startup (and kept fully in memory - so better not use that mechanism with large files).
{noformat}
#Text extraction timed out for the following binaries, and will not be retried
#Thu Nov 09 12:33:52 CET 2017
405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError
d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#1757777=TextExtractionError
251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError
{noformat}
* Failed extraction is cached.
* Number of extractions that timed out can be read via JMX (TextExtractionStatsMBean.getTimeoutCount).
Each of those threads can consume 100% CPU (unless they stop at some point).
* It is using its own executor service with daemon threads. This is shut down when stopping
the service, and restarted when needed. Just one thread usually, up to 10 (configurable),
so worst case up to 900% CPU usage if 9 extractions time out. 
* Thread name is "oak binary text extractor" plus the name of the extracted blob (similar
to what it was before).
* Only binaries larger than 16 KB are extracted in a separate thread.
* A warning is logged if extraction times out.
* No change for OutOfMemory and so on (Throwable was already caught before this patch). So
this patch only affects timeouts.


was (Author: tmueller):
http://svn.apache.org/r1814745

[~chetanm] I have incorporated your requests. Features:
* No OSGi / JMX configuration right now, but "emergency" configuration via system properties
(for example, ability to disable this feature, set timeout,...)
* Timeout is 60 seconds.
* Timed out extraction is now stored to a file in the repository / index directory, in a properties
file named "textExtractionTimeout.properties". Example content below. This file is read on
startup.
{noformat}
#Text extraction timed out for the following binaries, and will not be retried
#Thu Nov 09 12:33:52 CET 2017
405dfb76526462a6268f1aacb09359179216df423c474b3a1f578b9c567faa35\#190148=TextExtractionError
d19a28de09b655dbe099ee9e72e5bc782088994cca054062213d80b22f2ac67f\#1757777=TextExtractionError
251c6082691578dc1aff306a59984e1b80a79befd8465e158335c5cbfe8bb596\#399142=TextExtractionError
{noformat}
* Failed extraction is cached.
* Number of extractions that timed out can be read via JMX (TextExtractionStatsMBean.getTimeoutCount).
Each of those threads can consume 100% CPU (unless they stop at some point).
* It is using its own executor service with daemon threads. This is shut down when stopping
the service, and restarted when needed. Just one thread usually, up to 10 (configurable),
so worst case up to 900% CPU usage if 9 extractions time out. 
* Thread name is "oak binary text extractor" plus the name of the extracted blob (similar
to what it was before).
* Only binaries larger than 16 KB are extracted in a separate thread.
* A warning is logged if extraction times out.
* No change for OutOfMemory and so on (Throwable was already caught before this patch). So
this patch only affects timeouts.

> Skip problematic binaries instead of blocking indexing
> ------------------------------------------------------
>
>                 Key: OAK-5519
>                 URL: https://issues.apache.org/jira/browse/OAK-5519
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: indexing
>            Reporter: Alexander Klimetschek
>            Assignee: Thomas Mueller
>              Labels: resilience
>             Fix For: 1.8, 1.7.12
>
>
> If a text extraction is blocked (weird PDF) or a blob cannot be found in the datastore
or any other error upon indexing one item from the repository that is outside the scope of
the indexer, it currently halts the indexing (lane). Thus one item (that maybe isn't important
to the users at all) can block the indexing of other, new content (that might be important
to users), and it always requires manual intervention  (which is also not easy and requires
oak experts).
> Instead, the item could be remembered in a known issue list, proper warnings given, and
indexing continue. Maintenance operations should be available to come back to reindex these,
or the indexer could automatically retry after some time. This would allow normal user activity
to go on without manual intervention, and solving the problem (if it's isolated to some binaries)
can be deferred.
> I think the line should probably be drawn for binary properties. Not sure if other JCR
property types could trigger a similar issue, and if a failure in them might actually warrant
a halt, as it could lead to an "incorrect" index, if these properties are important. But maybe
the line is simply a try & catch around "full text extraction".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message