tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1033) Tika doesn't parse embedded OLE Chart/Graph objects
Date Mon, 04 Apr 2016 19:45:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224907#comment-15224907

Tim Allison commented on TIKA-1033:

I re-discovered this roughly a year ago on TIKA-1651.  :)  The # of parse exceptions on embedded
xls files was crazily higher than un-embedded xls...and then I discovered that they really
weren't xls.

On POI 54213, Yegor (who actually knows what he's talking about) confirmed my suspicions in
looking into the file format with POI.  This is a very different type of file from XLS.  I
did some investigatory hackery to modify the read lengths on POI and I could see some data,
but it looks like it'll take a fair amount of effort to add parsing for this without breaking
XLS parsing.

As a first step, we could follow Yegor's [recommendation|https://bz.apache.org/bugzilla/show_bug.cgi?id=54213#c4]
and add detection at least via inspection of the container.  What mime type do we want to
use?  {{application/ms-chart}}?

> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>                 Key: TIKA-1033
>                 URL: https://issues.apache.org/jira/browse/TIKA-1033
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: emb.ppt
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance
> 	at org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
> 	at org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
> 	at org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
> 	at org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
> 	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
> 	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.

This message was sent by Atlassian JIRA

View raw message