tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-456) Support timeouts for parsers
Date Thu, 10 Feb 2011 12:54:57 GMT

     [ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting updated TIKA-456:
-------------------------------

    Fix Version/s:     (was: 0.9)

Unscheduling until we have a good idea on how to implement this in practice.

> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common case is when
a parser is fed an incomplete document, such as what happens when limiting the amount of data
fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and then use this
with a FutureTask. For example, when using a ParsedDatum POJO for the results of the parse
operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, inputstream,
metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream is, Metadata
metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be able to guarantee
that none of the parsers being wrapped by Tika could ever hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. something
like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code above) and
use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse request,
but I don't think the thread overhead is significant when compared to the typical parser operation.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message