tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Reynolds (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-363) PDF Content Type seen as application/rdf+xml not appliction/pdf
Date Wed, 13 Jan 2010 21:23:54 GMT
PDF Content Type seen as application/rdf+xml not appliction/pdf

                 Key: TIKA-363
                 URL: https://issues.apache.org/jira/browse/TIKA-363
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.5
         Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9, tika-app-0.5.jar,
Eclipse 4.2, Lucene In Action, Second  source code TikaIndexer.java
            Reporter: Tim Reynolds
            Priority: Minor

I am using TikaIndexer.java from the source code of Lucene In Action Second Edition
to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), some files show
content type of application/rdf+xml not appliction/pdf, and thus show no meta data in Luke

The pdf files that show  application/rdf+xml were opened via Adobe Acrobat Pro 8.
Highlights/Bookmarks and Notes were added to the files, this was done several times
with many saves. Acrobat can read these files without problem.

The original pdfs, show application/pdf, the modified files show application/rdf+xml.

If I open the pdf files via my editor VIM, I do see some CR +LF strangeness.
Both the good & "bad" files have

0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......

for the first line, but the "bad" file doesn't have another $0d0a until

0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en

up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case that
something is getting confused because it sees this very long line. Why the file
stops using CR+LF I don't know. I assume this confusion then leads Tika to guess
this is an rdf+xml file.

I see the following bug in Tika: Mime type application/rdf+xml not correctly detected
[#TIKA-309], but it says it is fixed in 0.5 which I am using. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message