tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-447) Container aware mimetype detection
Date Mon, 02 Aug 2010 12:31:16 GMT

     [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting updated TIKA-447:
-------------------------------

    Attachment: TIKA-447-TikaInputStream.patch

BTW, the current new Detector implementations are a bit troublesome as they break the contract
that the detect() method must not close() the given stream and should use mark() and reset()
where necessary to avoid changing the state of the stream. The rationale behind this contract
is that you should be able to call parse() on the same stream instance after detecting its
type.

The attached patch fixes this issue by using the TikaInputStream.getFile() method to access
the underlying file (when available or spooled) when detecting these kinds of complex container
formats. If the given stream is not a TikaInputStream, then just the generic application/zip
or application/x-tika-msoffice type is returned.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process
container based formats (eg zip files and ole2 files) when trying to detect the correct mime
type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all
the work of parsing the whole file when they're not interested in knowing exactly what's in
it
> Once we have gone to the trouble of opening and parsing the container file, we should
try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message