tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Pearcy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-818) Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory vs performance tradeoff
Date Mon, 23 Jan 2012 06:31:40 GMT

    [ https://issues.apache.org/jira/browse/TIKA-818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190895#comment-13190895
] 

Paul Pearcy commented on TIKA-818:
----------------------------------

Hey Nick, 
  Thanks a ton for taking a look! Apologies for the delay in response. 

The key trigger for PDFBox to use in-memory vs temporary file is the RandomAccess passed to
the load method:
http://www.jarvana.com/jarvana/view/org/apache/pdfbox/pdfbox/1.6.0/pdfbox-1.6.0-javadoc.jar!/org/apache/pdfbox/pdmodel/PDDocument.html#load(java.io.InputStream,
org.apache.pdfbox.io.RandomAccess, boolean)

Here is a sample I've been hacking around with:
https://gist.github.com/1661161

The code probably isn't the best way to set things up for a couple of reasons:
- It'd be nice to allow callers to pick memory or file buffers. Not sure what the correct
approach would be to keep Tika interface clean.
- I think TikaInputStream has its own temporary file resource management that should probably
be used. Haven't figured that out yet. 

Thanks and Best Regards,
Paul
                
> Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory
vs performance tradeoff
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-818
>                 URL: https://issues.apache.org/jira/browse/TIKA-818
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10, 1.0
>            Reporter: Paul Pearcy
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After upgrading to Tika 0.10, began having OOM errors processing large amounts of PDFs
in parallel. The heap dump indicated that all the memory was getting used up by PDFBox RandomAccessBuffers.
After digging around, it looks like PDFBox now defaults to using RAM vs temporary files for
PDF extraction. This can be overridden to use RandomAccessFiless. 
> I propose that Tika controls file vs buffer based on the inputstream type received. If
the TikaInputStream is a file, RandomAccessFile should be used and for other stream types,
RandomAccessBuffer can be used. 
> I believe the code to control this is here:
> https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> At ~line 87:
> PDDocument pdfDocument =
>             PDDocument.load(new CloseShieldInputStream(stream), true);
> Not sure if this is the best approach and am curious if there are other ideas on how
to control this and keep the interface clean. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message