tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Valjanski <max...@jet.msk.su>
Subject Re: Detecting container formats
Date Thu, 17 Jun 2010 07:44:17 GMT

-10.01.-28163 22:59, Nick Burch пишет:
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd 
> etc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, 
> .odf, .odt, .ots, .sxw etc), I don't think the current method works 
> well. AFAICT,
> we detect the container, then have sub-class matches that try to look 
> for the appropriate children by hoping we can guess where the 
> definition might hide within the container. However, I think this is 
> too unreliable - for example, with a .doc file, the entry for the Word 
> stream can come anywhere in the list of top level entries, so is very 
> hard to reliably find without properly parsing the OLE2 structure
I tried to do that, but I found that this does not fit into Tika 
architecture. It is required to read whole file to parse OLE-container. 
Tika works with streams, so we can

1) remove streaming support and work only with files (or save stream 
into temporaty file before processing), or
2) parse OLE-container on mime-type detection and transfer it to text 
extractor (parser)

I do not like first solution, but the second requires architecture 
changes in Tika.

Anyway, I wrote type detection code for OLE in TIKA-437.

best wishes, Max

View raw message