tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-416) Out-of-process text extraction
Date Tue, 18 Jan 2011 15:34:44 GMT

     [ https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting resolved TIKA-416.

       Resolution: Fixed
    Fix Version/s: 0.9
         Assignee: Jukka Zitting

An initial version of this feature is now working and included in the latest trunk.

To illustrate the improvement, here's what I'm seeing for example with one somewhat large
Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
	at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
	at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
	at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
	at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where hitting the
memory limit affects all active threads, not just the one using Tika.

With the new out-of-process parsing feature, it's possible to externalize this problem into
a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked server process
	at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
	at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
	at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.

> Out-of-process text extraction
> ------------------------------
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.9
> There's currently no easy way to guard against JVM crashes or excessive memory or CPU
use caused by parsing very large, broken or intentionally malicious input documents. To better
protect against such cases and to generally improve the manageability of resource consumption
by Tika it would be great if we had a way to run Tika parsers in separate JVM processes. This
could be handled either as a separate "Tika parser daemon" or as an explicitly managed pool
of forked JVMs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message