tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: [DISCUSS] options for XMP parsing?
Date Tue, 15 Mar 2016 12:29:29 GMT
Thank you!  Will take a look the SO link, and I'll see if I can dig up any of these in our
regression testing corpus.

-----Original Message-----
From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
Sent: Monday, March 14, 2016 1:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?

Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try
to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <tallison@mitre.org> wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned that it is
not a design feature/restriction on XMPBox that it doesn't handle non PDF/A files...only a
matter of patching and building out their current code base.   The downside is there's quite
a bit to do, the upside is that it is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, let me know...I'll
be starting with: https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/.
You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform extraction of XMP
for all image and PDF files for the common elements -- with special handling by file type
if necessary.  As you mentioned, it would also be great to add or modify our XMPScanner to
extract all XMP packets from a file...I've started dabbling with this here: https://github.com/tballison/tika/tree/xmp_scanner
.  I'd be interested to hear more about what happens with InDesign files. In our own test
set, we have a PDF file with two packets containing conflicting authorship info IIRC! :) 
It would be nice to expose both the canonical XMP info (with proper processing of "later-xmp-overrides-earlier")
as well as all of the info that can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two
different use cases.
> 
> Thank you, again.
> 
>             Cheers,
> 
>                   Tim
> 
> 
> 
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to contribute some
of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -----Original Message-----
>> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP
(though has not seen updates in a while), but getting the XMP packets out of the files is
tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign files
for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type (including
Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened
properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the project
has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last
legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of common
variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of
PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there better
options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>>> 
>>>        Best,
>>> 
>>>               Tim
>>> 
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it fails
with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/
which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp
>>> _
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>> 
>>> And no, there are no plans for anything on XMP at this time...
>>> 
>>> Tilman
>>> 
>>> 
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>>> All,
>>>> 
>>>> 
>>>> 
>>>> When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from
our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with
PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40%
of the XMPs.
>>>> 
>>>> 
>>>> 
>>>> I’m including a table below of the counts of exception messages.  Are there
any plans to make XMPBox more lenient or is this what we can expect going forward?
>>>> 
>>>> 
>>>> 
>>>> As always, I’m more than happy to help with files and tests.  Let me know
what I can do.
>>>> 
>>>> 
>>>> 
>>>>            Cheers,
>>>> 
>>>> 
>>>> 
>>>>                     Tim
>>>> 
>>>> 
>>>> 
>>>> No XmpParsingException on 42,022 files.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Exceptions:
>>>> 
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/pdfx/1.3/
>>>> 
>>>> 13403
>>>> 
>>>> Type 'originalDocumentID' not defined in 
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 3710
>>>> 
>>>> Missing pdfaSchema:property in type definition
>>>> 
>>>> 3113
>>>> 
>>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>> 
>>>> 2867
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 927
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=description]
>>>> 
>>>> 723
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xmp/InDesign/private
>>>> 
>>>> 710
>>>> 
>>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>>> name=subject]
>>>> 
>>>> 654
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>> 
>>>> 522
>>>> 
>>>> Failed to parse
>>>> 
>>>> 492
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 370
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/illustrator/1.0/
>>>> 
>>>> 262
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xfa/promoted-desc/
>>>> 
>>>> 188
>>>> 
>>>> Failed to instanciate property in xmp:CreateDate
>>>> 
>>>> 144
>>>> 
>>>> Schema is not set in this document : 
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>> 
>>>> 125
>>>> 
>>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>> 
>>>> 94
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.rwjf.org/rwjf/1.0
>>>> 
>>>> 84
>>>> 
>>>> Failed to instanciate property in xap:CreateDate
>>>> 
>>>> 74
>>>> 
>>>> Invalid array definition, expecting Bag and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=language]
>>>> 
>>>> 68
>>>> 
>>>> Invalid array definition, expecting Alt and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 49
>>>> 
>>>> Cannot find a definition for the namespace http://www.sap.com
>>>> 
>>>> 46
>>>> 
>>>> Failed to instanciate property in exif:ColorSpace
>>>> 
>>>> 33
>>>> 
>>>> Failed to instanciate property in xmpMM:History
>>>> 
>>>> 28
>>>> 
>>>> xmp should start with a processing instruction
>>>> 
>>>> 26
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.0/
>>>> 
>>>> 24
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.npes.org/pdfx/ns/id/
>>>> 
>>>> 21
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>> 
>>>> 14
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 14
>>>> 
>>>> Failed to instanciate property in xmp:MetadataDate
>>>> 
>>>> 12
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.xinet.com/webnative/private/1.0/
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xap:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xmp:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Type 'params' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>> 
>>>> 9
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>>> name=History]
>>>> 
>>>> 8
>>>> 
>>>> Type 'documentName' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 8
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.day.com/dam/1.0
>>>> 
>>>> 7
>>>> 
>>>> Cannot find a definition for the namespace ptc
>>>> 
>>>> 7
>>>> 
>>>> Failed to instanciate property in xapMM:History
>>>> 
>>>> 6
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl
>>>> [prefix=tiff; name=YCbCrPositioning]
>>>> 
>>>> 5
>>>> 
>>>> Schema is not set in this document : 
>>>> http://purl.org/dc/elements/1.1/
>>>> 
>>>> 5
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.extensis.com/meta/FontSense/
>>>> 
>>>> 4
>>>> 
>>>> Excepted xpacket 'end' attribute (must be present and placed in
>>>> first)
>>>> 
>>>> 4
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>>> name=TextLayers]
>>>> 
>>>> 3
>>>> 
>>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>> 
>>>> 3
>>>> 
>>>> no message (NPE)
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://laserfiche.com/xmp/schema/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Failed to instanciate property in xapRights:Marked
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.cambridgeassociates.com/status/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.computershare.com.au/ccs/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.tripletriangle.com/ns/tripletri/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.1/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfa/ns/id.html
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfe/ns/id/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Failed to instanciate property in xmpRights:Marked
>>>> 
>>>> 1
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 1
>>>> 
>>>> This namespace is not a schema or a structured type : 
>>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>> 
>> 
> 

Mime
View raw message