tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext
Date Tue, 08 Nov 2011 16:55:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146397#comment-13146397

Michael McCandless commented on TIKA-612:

bq. This would make it easy for client applications to apply also other PDF parsing settings
not currently known by Tika.

+1, this seems like it'd be more general.  EG, we could fold in get/setSuppressDuplicateOverlappingText
(and move it off of PDFParser), and maybe also get/setEnableAutoSpace.

In general, since there are so many options on PDFTextStripper, and the "right" settings seems
to vary PDF by PDF, it means it's important that we expose full control...
> Specify PDFBox options via ParseContext 
> ----------------------------------------
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently
hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message