tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
Date Fri, 07 Feb 2014 10:29:21 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894376#comment-13894376

Andrew Jackson commented on TIKA-1232:


For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox doesn't already
do anything along those lines, but I am not all that familiar with that codebase so it's worth
checking first.

As for (2), I've only tested on a fairly small number of PDFs because only the more recent
versions of the Adobe tools actually make use of them, and even then, only when necessary.
I ran that code against a web archive corpus containing around 2 billion resources, including
many millions of PDFs, but because that dataset only ran up to 2010, I found a grand total
of eight PDFs that used Adobe Extension Level 3. It worked fine on those!

Finally, on the metadata property scheme, I feel the 'right place' is as a parameter on the
Content Type, but I accept that may confuse client code (i.e. people assuming type.equals("application/pdf")
should always work, even though that would be no good for other types like HTML due to the
charset parameter). 

Note that the parameter approach also allows you to do version detection in Tika's [custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357],
which I find rather handy. Of course, you are also welcome to take any of those signatures
if they are of interest.

> Add PDF version to PDFParser output
> -----------------------------------
>                 Key: TIKA-1232
>                 URL: https://issues.apache.org/jira/browse/TIKA-1232
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.5
>         Environment: JDK6
>            Reporter: William Palmer
>            Assignee: Tim Allison
>            Priority: Minor
>         Attachments: pdfversion.patch
> I'd like to identify the PDF version of files, this is not currently reported by the
PDFParser although the information is available via PDFBox.  I have attached a patch that
adds the format version to the Metadata object.
> However, I am not familiar enough with the Tika source to know if an alternative metadata
key should be used, or this new one added.
> Comments welcome.

This message was sent by Atlassian JIRA

View raw message