Hello!
-10.01.-28163 22:59, Nick Burch пишет:
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd
> etc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx,
> .odf, .odt, .ots, .sxw etc), I don't think the current method works
> well. AFAICT,
> we detect the container, then have sub-class matches that try to look
> for the appropriate children by hoping we can guess where the
> definition might hide within the container. However, I think this is
> too unreliable - for example, with a .doc file, the entry for the Word
> stream can come anywhere in the list of top level entries, so is very
> hard to reliably find without properly parsing the OLE2 structure
>
I tried to do that, but I found that this does not fit into Tika
architecture. It is required to read whole file to parse OLE-container.
Tika works with streams, so we can
1) remove streaming support and work only with files (or save stream
into temporaty file before processing), or
2) parse OLE-container on mime-type detection and transfer it to text
extractor (parser)
I do not like first solution, but the second requires architecture
changes in Tika.
Anyway, I wrote type detection code for OLE in TIKA-437.
best wishes, Max
|