tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thorsten Schäfer (JIRA) <j...@apache.org>
Subject [jira] [Commented] (TIKA-1631) OutOfMemoryException in ZipContainerDetector
Date Thu, 13 Apr 2017 14:35:41 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967654#comment-15967654

Thorsten Schäfer commented on TIKA-1631:


unfortunately we are also running into this bug with Tika 1.13. Do you have any plans on fixing

Currently the {{ZipContainerDetector}} initiates a CompressorInputStream for the file and
uses {{CompressorParser.getMediaType}} to do an instanceof check (code snippet below). The
bug could be circumvented by not creating the stream in the first place but move the signature
checks from {{CompressorStreamFactory#createCompressorInputStream}} to {{ZipContainerDetector#detectCompressorFormat}}.

Current state:
private static MediaType detectCompressorFormat(byte[] prefix, int length) {
        try {
            CompressorStreamFactory factory = new CompressorStreamFactory();
            CompressorInputStream cis = factory.createCompressorInputStream(
                    new ByteArrayInputStream(prefix, 0, length));
            try {
                return CompressorParser.getMediaType(cis);
            } finally {
        } catch (CompressorException e) {
            return MediaType.OCTET_STREAM;

> OutOfMemoryException in ZipContainerDetector
> --------------------------------------------
>                 Key: TIKA-1631
>                 URL: https://issues.apache.org/jira/browse/TIKA-1631
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.8
>            Reporter: Pavel Micka
> When I try to detect ZIP container I rarely get this exception. It is caused by the fact
that the file looks like ZIP container (magics), but in fact its random noise. So Apache decompress
tries to find the size of tables (expects correct stream), loads coincidentally huge number
(as on the given place there can be anything in the stream) and tries to allocate array of
several GB in size (hence the exception).
> This bug negatively influences stability of systems running Tika, as the decompressor
can accidentally allocate as much memory as is available and other parts of the system then
might not be able to allocate their objects.
> A solution might be to add additional parameter to Tika config that would limit size
of these arrays. If the size would be bigger, it would throw exception. This change should
not be hard, as method InternalLZWInputStream.initializeTables() is protected.  
> Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.commons.compress.compressors.z._internal_.InternalLZWInputStream.initializeTables(InternalLZWInputStream.java:111)
> 	at org.apache.commons.compress.compressors.z.ZCompressorInputStream.<init>(ZCompressorInputStream.java:52)
> 	at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:186)
> 	at org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:106)
> 	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:92)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)

This message was sent by Atlassian JIRA

View raw message