tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mungeol heo (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (TIKA-330) Better HWP (Hangul Word Processor) detection pattern
Date Wed, 02 Sep 2015 06:37:45 GMT

     [ https://issues.apache.org/jira/browse/TIKA-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

mungeol heo updated TIKA-330:
    Comment: was deleted

(was: HWP file has two file formats now which are HWP 3.0 and HWP 5.0.
The signature string start with "HWP Document File V" only can detect HWP 3.0.
It should be changed to "HWP Document File" for detecting both version of file formats of
HWP file.)

> Better HWP (Hangul Word Processor) detection pattern
> ----------------------------------------------------
>                 Key: TIKA-330
>                 URL: https://issues.apache.org/jira/browse/TIKA-330
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
> The current magic byte pattern we have for the HWP (Hangul Word Processor, application/x-hwp)
file format matches also the test-outlook.msg test file we have. I looked for a better detection
pattern and found one from OpenOffice.org.
> The hwpfilter/source/hwpfile.cpp file suggests that all HWP files start with the signature
string "HWP Document File V", so I'll change the detection pattern accordingly.

This message was sent by Atlassian JIRA

View raw message