tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1731) Try to integrate java-hwp into Tika
Date Fri, 11 Sep 2015 11:40:46 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740625#comment-14740625

Tim Allison commented on TIKA-1731:

Based on only a very cursory look at the examples+specs you sent, I'd say:
# HWP 3.0 = HWP 3.0 (it appears to be its own binary format...might be derived from something.
 I just don't know)
# HWP 5.0 ~ (is kind of like) .doc ... it uses the same general underlying file structures
as .doc (OLE), but it does some dramatically different things.

If you can figure out how to generate the equivalent of a .docx from hwp, it'd be useful to
see if Tika can handle that.

To test equivalence with .docx, change the file suffix to .zip, and unzip it.  If it unzips
and you see a bunch of xml files, we're on the right track...

> Try to integrate java-hwp into Tika
> -----------------------------------
>                 Key: TIKA-1731
>                 URL: https://issues.apache.org/jira/browse/TIKA-1731
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Minor
> Now that we have detection working for hwp files, it would be great to add a parser.
> [java-hwp|https://github.com/ddoleye/java-hwp] looks like a promising candidate.  We'd
need to ask ddoleye about a potential change in license and then interest in maintenance +
pushing to maven.
> Any other candidates?

This message was sent by Atlassian JIRA

View raw message