tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1190) ZipContainerDetector.detect() can spool the entire stream to a temporary file
Date Fri, 01 Nov 2013 16:08:17 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811392#comment-13811392
] 

Jukka Zitting commented on TIKA-1190:
-------------------------------------

bq. Isn't the right fix then to pull out that part of the detector to a new one?

Right, we could do that.

The reason I'm hesitant about that approach is that the way I've thought about the Detector
mechanism is that it's always guaranteed to be an {{O(1)}} operation (that was one of my original
design goals for the interface), i.e. independent of the size of the input document. The current
behavior makes it a potentially {{O(n\)}} operation, which was quite surprising at least to
me in a case where we were using Tika.detect() on a large ZIP archive in transit over the
network.

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-1190
>                 URL: https://issues.apache.org/jira/browse/TIKA-1190
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls {{getFile()}}
on a given {{TikaInputStream}} instance (that looks like a ZIP archive) without using the
{{hasFile()}} method to check whether a backing file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the entire stream
getting spooled to a temporary file that might not be needed at all after the detection.
> A better approach would be to only do the more detailed "full file" format detection
if the backing file is already available, i.e. if {{hasFile()}} returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message