tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject [DISCUSS] options for XMP parsing?
Date Tue, 08 Mar 2016 17:50:09 GMT
All,

  PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated
from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
  
  XmpBox was "written for PDF/A checking," not for robust processing of common variants of
XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
 
  In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
  
  Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than
copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?

          Best,

                 Tim

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, March 08, 2016 12:13 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that
are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which
is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf

And no, there are no plans for anything on XMP at this time...

Tilman


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current
reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT,
and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any
plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I
can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace 
> http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
> name=History]
>
> 8
>
> Type 'documentName' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
> name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace 
> http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
> name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace 
> http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; 
> name=creator]
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : 
> http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail:
dev-help@pdfbox.apache.org

Mime
View raw message