tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-154) Better detection of plain text versus binary formats with a text header
Date Sat, 17 Jan 2009 01:17:59 GMT

     [ https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting resolved TIKA-154.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.3
         Assignee: Jukka Zitting

In revision 735193 I implemented the plain text detection mechanism described in section 4
of the content type sniffing draft [1] I mentioned earlier on the mailing list.

This seems to work pretty fine, and finally allows us to detect plain text documents with
no file name or type hints. :-)

Resolving as Fixed.

[1] http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt

> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
>                 Key: TIKA-154
>                 URL: https://issues.apache.org/jira/browse/TIKA-154
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>
> Antoni Mylka noted on the mailing list:
>     Many binary formats begin with magic byte sequences composed of ASCII characters,
e.g.
>     zipfiles begin with PK
>     pdfs begin with %PDF-
>     chms help files begin with ITSF
>     etc.
> Tika should do a better job of detecting such cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message