tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan van der Knijff (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
Date Wed, 05 Feb 2014 14:40:10 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892149#comment-13892149
] 

Johan van der Knijff commented on TIKA-1232:
--------------------------------------------

One thing to watch out for is that PDF has two places where you can define the version: the
file header and, from PDF 1.4 onward, the catalog dictionary  in the trailer. Both can be
different (in which case the latter has precedence) See p. 39 of ISO 32000: 

http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

On top of that PDF 1.7 also adds Extension Levels (p.108), maybe those should be included
as well?


> Add PDF version to PDFParser output
> -----------------------------------
>
>                 Key: TIKA-1232
>                 URL: https://issues.apache.org/jira/browse/TIKA-1232
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.5
>         Environment: JDK6
>            Reporter: William Palmer
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: pdfversion.patch
>
>
> I'd like to identify the PDF version of files, this is not currently reported by the
PDFParser although the information is available via PDFBox.  I have attached a patch that
adds the format version to the Metadata object.
> However, I am not familiar enough with the Tika source to know if an alternative metadata
key should be used, or this new one added.
> Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message