tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title
Date Wed, 15 Jul 2015 11:33:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627913#comment-14627913
] 

Andrew Jackson commented on TIKA-1678:
--------------------------------------

I'm seeing this in about 220,000 out of 21,204,351 PDFs crawled from 2013 onwards, so it's
a lot, but a small percentage. I thought it might be down to one or two implementations, but
I'm seeing a fairly broad range of software IDs:

{noformat}
     "generator": [
        "Bullzip PDF Printer / www.bullzip.com / Freeware Edition (not registered)",
        30,
        "GPL Ghostscript 8.54 PDF Writer",
        30,
        "PDFCreator Version 0.9.5",
        30,
        "PDFCreator Version 1.1.0",
        30,
        "Neevia Document Converter 5.2",
        29,
        "pdfcreator Version 0.9.9",
        29,
        "AFPL Ghostscript 8.54 PDF Writer",
        28,
        "PDFCreator Version 1.0.0",
        28,
        "GPL Ghostscript 8.64 ps2pdf.com",
        27,
...
{noformat}

That octal UTF-16 BE BOM is pretty specific, so I think writing a hander to catch it is unlikely
to cause problems elsewhere. But I'm not really sure how to fix this either.

In case it helps, here are some more (randomly chosen) URLs that seem to display the same
issue (if they've not disappeared from the live web already!):

{noformat}
http://www.girlsb.org.uk/media/060a6d49/After_McDonaldization_Chapter_1.pdf
http://www.uniswales.ac.uk/wp/media/2011-March-The-Impact-of-International-and-EU-Students-in-Wales.pdf
http://www.youthworkwales.org.uk/creo_files/upload/files/gd_in_yw_conceptual_model_2009_1_.pdf
http://www.transitionchepstow.org.uk/wp-content/uploads/2014/09/Living-with-climate-change-poster.pdf
http://www.staustelltowncouncil.com/St-Austell-Town-Council/UserFiles/Files/Committees/Community/Agendas/2010/community%20agenda%206%20Sept%2010.pdf
http://community.stroud.gov.uk/_documents/79_SmartWater-NW-Kit-Leaflet3.pdf
http://www.recycleformerthyr.co.uk/media/9365/dowlais%20juniors%20school%20photos.pdf
http://www.visitmerthyr.co.uk/media/24663/volunteering_poster.psd_welsh.pdf
http://merthyrcynon.foodbank.org.uk/resources/documents/Get%20Involved/Gift-Aid-Form/Gift-Aid-form.pdf
http://www.basquechildren.org/-/docs/clarion
http://www.biicl.org/files/3776_5_-_richard_happ.pdf
http://www.artscouncil-ni.org/images/uploads/publications-documents/ArtsandHealth.pdf
http://www.lawsoc-ni.org/download/fs/doc/LEXCEL_APPLICATION_&_STATUS_ENQUIRY_FORMS_2010%5b1%5d/pdf/
http://www.templechurch.com/wp-content/uploads/2012/08/Olympic-poster.pdf
http://www.llennatur.com/files/u1/Cylchgrawn32.pdf
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/315413/cau101.pdf
https://www.networks.nhs.uk/nhs-networks/common-assessment-framework-for-adults-learning/archived-material-from-caf-network-website-pre-april-2012/barnsley-ig-toolkit-aug-2010/FINAL_BMBC_ASSD_NHS_Numb
http://www.ccfgb.co.uk/images/Visit.pdf
http://www.wihb.scot.nhs.uk/hairt-reports/policies/key-infection-prevention-policies?task=document.viewdoc&id=22
http://www.ed.ac.uk/polopoly_fs/1.94783!/fileManager/martha%20hamilton%20trust%20app%20form12.pdf
http://stophs2.org/wp-content/uploads/2010/11/EHS_booklet.pdf
http://www.brookes.ac.uk/Documents/Regulations/Current/Core/A1/Technolgy--Design---Environment-Prizes/
http://www.nus.org.uk/PageFiles/4011/ACTSA_Events_December.pdf
https://www.ids.ac.uk/files/dmfile/GCSTDemocracyandSecurity34_WP8.pdf
http://www.theigc.org/wp-content/uploads/2015/02/Chaudhry-Woodruff-2013-Working-Paper.pdf
{noformat}

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority of the documents.
The PDF metadata can be encoded as UTF-18 octets, but is not always being decoded as such.
> A specific example is here: http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:pdf='http://ns.adobe.com/pdf/1.3/'
pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000
\000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'
xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' xmlns:dc='http://purl.org/dc/elements/1.1/'
dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000
\000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000
\000P\000r\000i\000n\000t\000e\000r\000 \000/\000 \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, but the
ones encoded in the actual PDF metadata fields should be extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000
\000-\000 \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is the XML
dc:title being used to override the PDF title field? Or is one of the title fields being decoded
incorrectly?
> (I accept that although this is a real PDF document from the web, it is also a malformed
one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message