tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
Date Wed, 15 Jul 2015 11:06:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627897#comment-14627897
] 

Tim Allison commented on TIKA-1678:
-----------------------------------

@Andrew Jackson, good to hear from you!  Y, the current code tries to pull content out of
the xmp for some Dublin core items.  If that info is not available then it backs off to the
"native" metadata.  

So, I'm not sure how to fix this...  Any recommendations?

I'm slightly puzzled that you are getting two author entries...I'll look into this.

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents.
The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such.
> A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/'
pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000
\000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'
xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/'
dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000
\000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000
\000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the
ones encoded in the actual PDF metadata fields should be extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is the XML
dc:title being used to override the PDF title field? Or is one of the title fields being decoded
incorrectly?
> (I accept that although this is a real PDF document from the web, it is also a malformed
one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message