tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-447) Container aware mimetype detection
Date Wed, 04 Aug 2010 09:32:16 GMT

    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895192#action_12895192

Jukka Zitting commented on TIKA-447:

I committed my patch in revision  982175.

> memory and processing impact of opening the container

I think this acceptable as the extra cost is only associated with specific media types, and
we can use the open container feature you added to TikaInputStream to allow later parsing
stages to avoid duplicating these costs. Also, since this functionality is now only triggered
when the detector is passed a TikaInputStream, a performance-conscious user can easily prevent
the extra processing. We might also want to add some extra flag for this if needed.

> detectors run in the right order

This was a part of my thinking behind the proposed getSupportedTypes() method. With that we
could choose to only run these kinds of more complex detectors when simpler detectors have
first identified the basic container format.

> Container aware mimetype detection
> ----------------------------------
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
> As discussed on the dev list, Tika should ideally have a configurable way to process
container based formats (eg zip files and ole2 files) when trying to detect the correct mime
type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all
the work of parsing the whole file when they're not interested in knowing exactly what's in
> Once we have gone to the trouble of opening and parsing the container file, we should
try to keep the open container around to speed up parsing of the contents

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message