tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
Date Tue, 21 Jul 2015 12:29:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635032#comment-14635032
] 

Andrew Jackson commented on TIKA-1678:
--------------------------------------

Sorry for the delay. Here are the results:

* title starts with \376\377: 252,903 out of 21,204,500 PDFs.
* title starts with \377: 0 out of 21,204,500 PDFs.
* title starts with \357: 0 out of 21,204,500 PDFs.

There is a tiny handful of mixed-up oddities, that look like this:

{code}
{
        "url":"http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf",
        "wayback_date":"20141205021311",
        "title":"(Microsoft Word - \\323\\365\\354\\354\\345\\364\\357\\367\\336 \\364\\347\\362
PRAKSIS \\363\\364\\357 BIG MARCH _1_)",
        "generator":["PScript5.dll Version 5.2.2",
          "GPL Ghostscript 8.15"]},
{code}

(see the original here: http://web.archive.org/web/20150721122710/http://www.praksis.gr/assets/files/h_PRAKSIS_sto_big_march.pdf)

But these are such minor exceptions I don't think it's worth pursuing. 

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents.
The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such.
> A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/'
pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000
\000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'
xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/'
dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000
\000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000
\000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the
ones encoded in the actual PDF metadata fields should be extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is the XML
dc:title being used to override the PDF title field? Or is one of the title fields being decoded
incorrectly?
> (I accept that although this is a real PDF document from the web, it is also a malformed
one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message