tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-456) Support timeouts for parsers
Date Mon, 05 Jul 2010 20:42:51 GMT
Support timeouts for parsers

                 Key: TIKA-456
                 URL: https://issues.apache.org/jira/browse/TIKA-456
             Project: Tika
          Issue Type: Improvement
            Reporter: Ken Krugler
            Assignee: Chris A. Mattmann

There are a number of reasons why Tika could hang while parsing. One common case is when a
parser is fed an incomplete document, such as what happens when limiting the amount of data
fetched during a web crawl.

One solution is to create a TikaCallable that wraps the Tika   parser, and then use this with
a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse operation,
I do something like this:

    parser = new AutoDetectParser();
    Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream,
    FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
    Thread t = new Thread(task);

    ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);

And TikaCallable() looks like:

class TikaCallable implements Callable<ParsedDatum> {
    public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata metadata)
        _parser = parser;
        _handler = handler;
        _input = is;
        _metadata = metadata;

    public ParsedDatum call() throws Exception {
        _parser.parse(_input, _handler, _metadata, new ParseContext());

This seems like it would be generally useful, as I doubt that we'd  ever be able to guarantee
that none of the parsers being wrapped by Tika could ever hang.

One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something like:

  Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);

Then the call to p.parse(...) would create a Callable (similar to the code above) and use
the specified timeout when calling task.get().

One minus with this approach is that it creates a new thread for each parse request, but I
don't think the thread overhead is significant when compared to the typical parser operation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message