tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-416) Out-of-process text extraction
Date Tue, 18 Jan 2011 15:36:43 GMT

    [ https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983229#action_12983229
] 

Jukka Zitting edited comment on TIKA-416 at 1/18/11 10:35 AM:
--------------------------------------------------------------

An initial version of this feature is now working and included in the latest trunk.

To illustrate the improvement, here's what I'm seeing for example with one somewhat large
Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
	at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
	at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
	at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
	at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where hitting the
memory limit affects all active threads, not just the one using Tika.

With the new out-of-process parsing feature, it's possible to externalize this problem into
a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork large.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked server process
	at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
	at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
	at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.

      was (Author: jukkaz):
    An initial version of this feature is now working and included in the latest trunk.

To illustrate the improvement, here's what I'm seeing for example with one somewhat large
Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
	at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
	at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
	at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
	at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where hitting the
memory limit affects all active threads, not just the one using Tika.

With the new out-of-process parsing feature, it's possible to externalize this problem into
a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked server process
	at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
	at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
	at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.
  
> Out-of-process text extraction
> ------------------------------
>
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.9
>
>
> There's currently no easy way to guard against JVM crashes or excessive memory or CPU
use caused by parsing very large, broken or intentionally malicious input documents. To better
protect against such cases and to generally improve the manageability of resource consumption
by Tika it would be great if we had a way to run Tika parsers in separate JVM processes. This
could be handled either as a separate "Tika parser daemon" or as an explicitly managed pool
of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message