tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hanssens Bart <Bart.Hanss...@fedict.be>
Subject RE: [bulk] Info from parser on handling partial input
Date Thu, 08 Oct 2009 17:34:30 GMT
Hi Ken,

> The file was truncated because I'd configured Bixo to only fetch
> the first 65K of a file, to avoid problems caused by huge files.
> XML is not fine, from what I've seen - the parser will fail if it runs
> into the end of document before finishing the parse.

Using a StAX parser, I guess it should be possible to parse the XML
and just stop after running out of data at 64 K... 

> And binary formats like zip, pdf, etc are definitely not OK with
> truncation.

Have to check on PDF, but some data can probably be retrieved
(especially the "web-optimized" PDFs), 

Some zips might be OK: if one manages to get at least one zipentry
before hitting the 64 K limit (say xml-in-zip formats like ODF, OOXML,
ePUB), it should be possible to index it partially.

It will get messy, though, if the supporting libraries insist on checking
the completeness of the file...

Just my 2 cents,

View raw message