tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-447) Container aware mimetype detection
Date Mon, 02 Aug 2010 09:45:15 GMT

    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894510#action_12894510
] 

Nick Burch commented on TIKA-447:
---------------------------------

Alex - have a look at the code, I think it already does what you're asking of it :)

For OLE2, when we detect the ole2 signature, we load the file into POIFS. We then ask the
detector what it is based on this

For Zip, we look at each entry in the zip file in turn. If it's one we recognise the name
of, and that tells us all we need, we return. Otherwise, we open up that entry, and grab the
mime type from that, and return.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process
container based formats (eg zip files and ole2 files) when trying to detect the correct mime
type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all
the work of parsing the whole file when they're not interested in knowing exactly what's in
it
> Once we have gone to the trouble of opening and parsing the container file, we should
try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message