tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David vandendriessche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-100) Structured PDF parsing
Date Fri, 01 Mar 2013 11:05:13 GMT

    [ https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590425#comment-13590425

David vandendriessche commented on TIKA-100:

At the moment I'm using pdfbox  to upload my data to solr(seachengine). Since it doesn't support
page extraction.

I'm pretty sure if tika(Solr uses tika if you use the extracthandler) gets this. They might
change solr so it can return page hits for pdf's.

> Structured PDF parsing
> ----------------------
>                 Key: TIKA-100
>                 URL: https://issues.apache.org/jira/browse/TIKA-100
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
> The PDF parser currently extracts and outputs document content as a single string. PDFBox
could be used to support structuring at least down to page and paragraph (not sure how accurate)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message