tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-447) Container aware mimetype detection
Date Mon, 06 Dec 2010 00:48:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967055#action_12967055

Jukka Zitting commented on TIKA-447:

In revision 1042497 I added an auto-loading mechanism for detectors so that tools like the
Tika facade or the AutoDetectParser class can automatically pick up all detector implementations
in the current classpath. This way also the container-aware detectors can be used with minimal
changes to client code.

To prevent excessive performance overhead, both the Zip and POIFS detectors will first check
for the relevant magic byte header and will only do the more expensive format check if the
byte header matches and if the given stream is a TikaInputStream instance.

In revision 1042498 I added a new --detect option to the CLI for easier testing of the auto-detect
functionality. Also, since the container-aware detectors are now automatically loaded and
used, there's no longer any need for the explicit --container-aware-detector option and I've
turned it into a no-op.

> Container aware mimetype detection
> ----------------------------------
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
> As discussed on the dev list, Tika should ideally have a configurable way to process
container based formats (eg zip files and ole2 files) when trying to detect the correct mime
type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all
the work of parsing the whole file when they're not interested in knowing exactly what's in
> Once we have gone to the trouble of opening and parsing the container file, we should
try to keep the open container around to speed up parsing of the contents

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message