tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: [bulk] Info from parser on handling partial input
Date Fri, 09 Oct 2009 09:06:28 GMT

On Thu, Oct 8, 2009 at 7:34 PM, Hanssens Bart <Bart.Hanssens@fedict.be> wrote:
> Some zips might be OK: if one manages to get at least one zipentry
> before hitting the 64 K limit (say xml-in-zip formats like ODF, OOXML,
> ePUB), it should be possible to index it partially.

It's possible to read ZIP files in streaming mode, but see the caveats
listed in [1]. The current ZipParser in Tika does use the streaming
even though the result may be incorrect.

Once TIKA-153 is solved, we should be able to automatically switch to
more correct parsing when the full input document is available in
random-access mode.

[1] http://commons.apache.org/compress/zip.html


Jukka Zitting

View raw message