tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-456) Support timeouts for parsers
Date Tue, 06 Jul 2010 12:13:50 GMT

    [ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885514#action_12885514

Andrzej Bialecki  commented on TIKA-456:

Yes, if it's an optional functionality ... I think there are two use cases here:

* a common case is that a user application needs to have a guarantee that the parsing process
won't take longer than X, no matter what's the reason. After the X interval passes results
don't matter, the app needs to move to some other work. This may include situations such as
slow network or overloaded DB, and the parsing should time out even if it's not the Tika's

* and the second case, when application can tolerate long parsing time as long as it can ensure
that there is still _some_ progress ... if there is a lack of progress then the parsing should

The first case can be handled by solutions described above. The second case can be handled
by a watchdog flag that the application can periodically check. The flag (or counter) could
be incremented by any SAX event, and reset by checking the flag from the application, or there
could be a background thread in Tika that checks this.

> Support timeouts for parsers
> ----------------------------
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
> There are a number of reasons why Tika could hang while parsing. One common case is when
a parser is fed an incomplete document, such as what happens when limiting the amount of data
fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and then use this
with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse
operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream,
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata
metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be able to guarantee
that none of the parsers being wrapped by Tika could ever hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code above) and
use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse request,
but I don't think the thread overhead is significant when compared to the typical parser operation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message