tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mungeol heo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1728) Detection is not working properly for detecting HWP 5.0 file
Date Tue, 08 Sep 2015 06:13:45 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734273#comment-14734273
] 

mungeol heo edited comment on TIKA-1728 at 9/8/15 6:13 AM:
-----------------------------------------------------------

I have a question about new mime-type, which is addressed below.

<mime-type type="application/x-hwp-v5">
    <_comment>Hangul Word Processor File v5</_comment>
    <sub-class-of type="application/x-tika-msoffice"/>
</mime-type>

My question is that is it possible the mime type detects non-HWP and OLE2 based file as HWP
5.0 file?
Isn't it a better choice adding magic, glob or other tags?
So it can be more unique?
For instance,

<mime-type type="application/x-hwp-v5">
    <sub-class-of type="application/x-tika-msoffice"/>
    <_comment>Hangul Word Processor File v5</_comment>
    <magic priority="40">
        <match value="HWP Document File" type="string" offset="0"/>
    </magic>
    <glob pattern="*.hwp"/>
</mime-type>


was (Author: mungeol):
I have a question about new mime-type, which is addressed below.

<mime-type type="application/x-hwp-v5">
    <_comment>Hangul Word Processor File v5</_comment>
    <sub-class-of type="application/x-tika-msoffice"/>
</mime-type>

My question is that is it possible the mime type detects non-HWP and OLE2 based file as HWP
5.0 file?
Isn't it a better choice adding magic, glob or other tags?
So it can be more unique?
For instance,

<mime-type type="application/x-hwp-v5">
    <sub-class-of type="application/x-tika-msoffice"/>
    <magic priority="40">
        <match value="HWP Document File" type="string" offset="0"/>
    </magic>
    <glob pattern="*.hwp"/>
</mime-type>

> Detection is not working properly for detecting HWP 5.0 file
> ------------------------------------------------------------
>
>                 Key: TIKA-1728
>                 URL: https://issues.apache.org/jira/browse/TIKA-1728
>             Project: Tika
>          Issue Type: Bug
>         Environment: OS: windows 7 and centos 6
> Java: 1.7
> Tika jar: tika-app-1.10.jar
> File: HWP 5.0
>            Reporter: mungeol heo
>         Attachments: HWP-document-file-formats-3.0-Korean.pdf, HWP-document-file-formats-5.0-Korean.pdf,
error-message.png, test_3.0.hwp, test_5.0.hwp
>
>
> HWP file has two formats which are HWP 3.0 and HWP 5.0.
> 'tika-app-1.10.jar' detects HWP 3.0 format's file correctly.
> But, not for HWP 5.0.
> Used commands and returned results are addresses below.
> > java -jar tika-app-1.10.jar --detect test_3.0.hwp
> > application/x-hwp
> > java -jar tika-app-1.10.jar --detect test_5.0.hwp
> > application/x-tika-msoffice



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message