tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrug...@transpac.com>
Subject Info from parser on handling partial input
Date Thu, 08 Oct 2009 16:52:15 GMT
Hi all,

I just ran into a problem where a truncated zip file is causing the  
ZipParser to hang.

The file was truncated because I'd configured Bixo to only fetch the  
first 65K of a file, to avoid problems caused by huge files.

This is common practice for web crawlers, but it means that I need to  
know which parsers can handle truncated content.

E.g. text is fine, HTML seems to be OK (based on my prior Nutch  
experience with NekoHTML).

XML is not fine, from what I've seen - the parser will fail if it runs  
into the end of document before finishing the parse.

And binary formats like zip, pdf, etc are definitely not OK with  
truncation.

So it seems like I'd want to have a parser call that returns back info  
about whether the parser can handle truncated content - e.g.

boolean truncatedOK(MimeType inputType);

As a stop-gap, I could assume that non-XML text was OK, and everything  
else was no-go for truncated content.

Thoughts on this?

Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message