tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1190) ZipContainerDetector.detect() can spool the entire stream to a temporary file
Date Fri, 01 Nov 2013 14:55:18 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811304#comment-13811304

Nick Burch commented on TIKA-1190:

bq. Doing so would also drop the advanced type header detection by commons-compress. That
detection code doesn't need the whole file, but is also too complex to express in the MIME
magic database.

Isn't the right fix then to pull out that part of the detector to a new one? That would allow
people to exclude the "needs full files" detectors like POIFS, Zip, Vorbis etc, while still
keeping the "needs the first bit of the file" compress detection?

bq. IMHO better to explain that advanced type detection is possible when the document is available
as a random-access file wrapped to a TikaInputStream

That doesn't feel right to me, especially as some detectors may be able to work with just
a stream. I'd much rather we say Tika will do its best unless you explicitly tell it otherwise.
Remember back a few years to all the queries on-list and in JIRA about incorrect detection
for these container formats. My belief is that most people asking for detection want the best
answer available. Those with special requirements (eg quickest close-enough in your case)
I believe should be explicitly asking for that, based on their specific requirements, rather
than changing the default for most people (including those new to Tika who'll be confused)

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>                 Key: TIKA-1190
>                 URL: https://issues.apache.org/jira/browse/TIKA-1190
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls {{getFile()}}
on a given {{TikaInputStream}} instance (that looks like a ZIP archive) without using the
{{hasFile()}} method to check whether a backing file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the entire stream
getting spooled to a temporary file that might not be needed at all after the detection.
> A better approach would be to only do the more detailed "full file" format detection
if the backing file is already available, i.e. if {{hasFile()}} returns true.

This message was sent by Atlassian JIRA

View raw message