tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
Date Thu, 06 Feb 2014 15:18:10 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893426#comment-13893426

Tim Allison commented on TIKA-1232:

[~anjackson], y, I'd like to add your code if others agree that it would be useful.  No need
for a formal patch.  I'll take your github code nearly directly.

Two items:
  1) Would you be interested in contributing your extension-level extraction code to PDFBox
if it doesn't currently exist there (I haven't checked but I assume you wouldn't reinvent
the wheel).  I think that would be more at home within PDFBox.
  2) How much testing have you done for potential exceptions thrown by PDFBox on pdfs in the
wild when grabbing this new metadata (cf. null pointer checks around date parsing in current
metadata code and TIKA-1226, TIKA-1232, TIKA-1233)?

Thank you, again.

> Add PDF version to PDFParser output
> -----------------------------------
>                 Key: TIKA-1232
>                 URL: https://issues.apache.org/jira/browse/TIKA-1232
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.5
>         Environment: JDK6
>            Reporter: William Palmer
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: pdfversion.patch
> I'd like to identify the PDF version of files, this is not currently reported by the
PDFParser although the information is available via PDFBox.  I have attached a patch that
adds the format version to the Metadata object.
> However, I am not familiar enough with the Tika source to know if an alternative metadata
key should be used, or this new one added.
> Comments welcome.

This message was sent by Atlassian JIRA

View raw message