tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Info from parser on handling partial input
Date Sat, 10 Oct 2009 14:15:07 GMT
Hi Jukka,

> On Thu, Oct 8, 2009 at 6:52 PM, Ken Krugler <kkrugler@transpac.com>  
> wrote:
>> I just ran into a problem where a truncated zip file is causing the
>> ZipParser to hang.
> Does it hang (i.e. never return), or throw a TikaException? The former
> would be a clear bug, the latter expected behaviour given that the
> file cannot be parsed.

It hung.

>> The file was truncated because I'd configured Bixo to only fetch  
>> the first
>> 65K of a file, to avoid problems caused by huge files.
>> This is common practice for web crawlers, but it means that I need  
>> to know
>> which parsers can handle truncated content.
> There's a somewhat related feature request TIKA-261, that approaches
> this issue from a slightly different angle.

>> E.g. text is fine, HTML seems to be OK (based on my prior Nutch  
>> experience
>> with NekoHTML).
>> XML is not fine, from what I've seen - the parser will fail if it  
>> runs into
>> the end of document before finishing the parse.
> If the truncated stream ends with a -1 return from read(), then I
> would expect the XML parser to throw a TikaException to signify a
> parse failure. If the streams throws an IOException to signify
> truncation, then the parser should propagate that exception up to the
> caller.
> The latter behavior suggests a way to cleanly implement the feature
> you're asking for. The given input stream could be wrapped into a
> decorator that throws a tagged IOException when the given size limit
> has been reached. A parser can capture such exceptions and cleanly
> close the emitted XHTML stream, potentially adding a metadata entry
> that signifies that the extracted text has been truncated.

Interesting idea. I'll need to capture a bunch of truncated zip files  
to test.

I filed https://issues.apache.org/jira/browse/TIKA-307 to capture this.

-- Ken

View raw message