tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Info from parser on handling partial input
Date Fri, 09 Oct 2009 09:00:00 GMT

On Thu, Oct 8, 2009 at 6:52 PM, Ken Krugler <kkrugler@transpac.com> wrote:
> I just ran into a problem where a truncated zip file is causing the
> ZipParser to hang.

Does it hang (i.e. never return), or throw a TikaException? The former
would be a clear bug, the latter expected behaviour given that the
file cannot be parsed.

> The file was truncated because I'd configured Bixo to only fetch the first
> 65K of a file, to avoid problems caused by huge files.
> This is common practice for web crawlers, but it means that I need to know
> which parsers can handle truncated content.

There's a somewhat related feature request TIKA-261, that approaches
this issue from a slightly different angle.

> E.g. text is fine, HTML seems to be OK (based on my prior Nutch experience
> with NekoHTML).
> XML is not fine, from what I've seen - the parser will fail if it runs into
> the end of document before finishing the parse.

If the truncated stream ends with a -1 return from read(), then I
would expect the XML parser to throw a TikaException to signify a
parse failure. If the streams throws an IOException to signify
truncation, then the parser should propagate that exception up to the

The latter behavior suggests a way to cleanly implement the feature
you're asking for. The given input stream could be wrapped into a
decorator that throws a tagged IOException when the given size limit
has been reached. A parser can capture such exceptions and cleanly
close the emitted XHTML stream, potentially adding a metadata entry
that signifies that the extracted text has been truncated.

> And binary formats like zip, pdf, etc are definitely not OK with truncation.
> So it seems like I'd want to have a parser call that returns back info about
> whether the parser can handle truncated content - e.g.
> boolean truncatedOK(MimeType inputType);

A somewhat related issue is TIKA-153, that asks for a way to pass full
files or memory buffers to a parser. A truncatedOK() method would
essentially tell whether parser will benefit from having such access
to the complete input document.


Jukka Zitting

View raw message