tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1490) Basic parser for old Excel files (eg Excel 4)
Date Mon, 22 Dec 2014 06:43:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255491#comment-14255491

Hudson commented on TIKA-1490:

SUCCESS: Integrated in tika-trunk-jdk1.7 #381 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/381/])
TIKA-1490 Use the Old Excel parser for older OLE2 based formats too, like Excel 5 and 95 (nick:
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OldExcelParser.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java

> Basic parser for old Excel files (eg Excel 4)
> ---------------------------------------------
>                 Key: TIKA-1490
>                 URL: https://issues.apache.org/jira/browse/TIKA-1490
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 1.7
> In TIKA-1487, we added mime magic for the pre-OLE2 excel file formats. Based on the reading
of the OpenOffice Excel docs for that, it looks like it should be possible to produce a basic
parser to extract key bits of info (eg strings) from these older file formats. 
> This would likely largely be done by having a custom record iterator for the older formats,
then passing the handful of "interesting" records to POI's record classes (maybe with some
tweaks for the older formats) to have the binary data parsed, then returned by the parser

This message was sent by Atlassian JIRA

View raw message