Thank you! Will take a look the SO link, and I'll see if I can dig up any of these in our
regression testing corpus.
-----Original Message-----
From: Ray Gauss [mailto:ray.gauss@alfresco.com]
Sent: Monday, March 14, 2016 1:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?
Hi Tim,
Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try
to help out where I can.
> You've been happy with it at Alfresco?
It's been a while since I looked at it but I don't recall any difficulties.
> I'd be interested to hear more about what happens with InDesign files.
It stores things in 'pages' [1].
Regards,
Ray
[1] http://stackoverflow.com/a/22661992
> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <tallison@mitre.org> wrote:
>
> Hi Ray,
> Got it. Thank you.
>
> That'd be great. In follow up discussion with PDFBox devs, they mentioned that it is
not a design feature/restriction on XMPBox that it doesn't handle non PDF/A files...only a
matter of patching and building out their current code base. The downside is there's quite
a bit to do, the upside is that it is a living code base.
>
> I'll experiment with Adobe's xmp-core. If you have any pointers/examples, let me know...I'll
be starting with: https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/.
You've been happy with it at Alfresco?
>
> No matter which package we use, it would be nice to build out uniform extraction of XMP
for all image and PDF files for the common elements -- with special handling by file type
if necessary. As you mentioned, it would also be great to add or modify our XMPScanner to
extract all XMP packets from a file...I've started dabbling with this here: https://github.com/tballison/tika/tree/xmp_scanner
. I'd be interested to hear more about what happens with InDesign files. In our own test
set, we have a PDF file with two packets containing conflicting authorship info IIRC! :)
It would be nice to expose both the canonical XMP info (with proper processing of "later-xmp-overrides-earlier")
as well as all of the info that can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two
different use cases.
>
> Thank you, again.
>
> Cheers,
>
> Tim
>
>
>
>
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
>
> To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.
>
> I'm not sure how much of that code would be useful but I may be able to contribute some
of it.
>
> Regards,
>
> Ray
>
>
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
>>
>> Thank you. Will take a look.
>>
>> -----Original Message-----
>> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>>
>> Hi Tim,
>>
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP
(though has not seen updates in a while), but getting the XMP packets out of the files is
tricker.
>>
>> We have XMPPacketScanner which works for many cases, but not all. InDesign files
for example do some strange things.
>>
>> In the past we've used different packet scanners depending on the file type (including
Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened
properties.
>>
>> Regards,
>>
>> Ray
>>
>>
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <tallison@mitre.org> wrote:
>>>
>>> All,
>>>
>>> PDFBox 2.0 is soon to be released. In the course of its development, the project
has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last
legs.
>>>
>>> XmpBox was "written for PDF/A checking," not for robust processing of common
variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of
PDFs from govdocs1/commoncrawl.
>>>
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>>
>>> Has anyone had any luck with an Apache-friendly XMP parser? Are there better
options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>>>
>>> Best,
>>>
>>> Tim
>>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>>
>>> I think the problem is that XmpBox was written for PDF/A checking, so it fails
with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/
which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp
>>> _
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>>
>>> And no, there are no plans for anything on XMP at this time...
>>>
>>> Tilman
>>>
>>>
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>>> All,
>>>>
>>>>
>>>>
>>>> When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from
our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs from PDFs with
PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40%
of the XMPs.
>>>>
>>>>
>>>>
>>>> I’m including a table below of the counts of exception messages. Are there
any plans to make XMPBox more lenient or is this what we can expect going forward?
>>>>
>>>>
>>>>
>>>> As always, I’m more than happy to help with files and tests. Let me know
what I can do.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>> No XmpParsingException on 42,022 files.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Exceptions:
>>>>
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/pdfx/1.3/
>>>>
>>>> 13403
>>>>
>>>> Type 'originalDocumentID' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>>
>>>> 3710
>>>>
>>>> Missing pdfaSchema:property in type definition
>>>>
>>>> 3113
>>>>
>>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>>
>>>> 2867
>>>>
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc;
>>>> name=creator]
>>>>
>>>> 927
>>>>
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>>>> name=description]
>>>>
>>>> 723
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/xmp/InDesign/private
>>>>
>>>> 710
>>>>
>>>> Invalid array type, expecting Bag and found Seq [prefix=dc;
>>>> name=subject]
>>>>
>>>> 654
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>>
>>>> 522
>>>>
>>>> Failed to parse
>>>>
>>>> 492
>>>>
>>>> Invalid array definition, expecting Seq and found
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>>> name=date]
>>>>
>>>> 370
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/illustrator/1.0/
>>>>
>>>> 262
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/xfa/promoted-desc/
>>>>
>>>> 188
>>>>
>>>> Failed to instanciate property in xmp:CreateDate
>>>>
>>>> 144
>>>>
>>>> Schema is not set in this document :
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>>
>>>> 125
>>>>
>>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>>
>>>> 94
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.rwjf.org/rwjf/1.0
>>>>
>>>> 84
>>>>
>>>> Failed to instanciate property in xap:CreateDate
>>>>
>>>> 74
>>>>
>>>> Invalid array definition, expecting Bag and found
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>>> name=language]
>>>>
>>>> 68
>>>>
>>>> Invalid array definition, expecting Alt and found
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>>> name=title]
>>>>
>>>> 49
>>>>
>>>> Cannot find a definition for the namespace http://www.sap.com
>>>>
>>>> 46
>>>>
>>>> Failed to instanciate property in exif:ColorSpace
>>>>
>>>> 33
>>>>
>>>> Failed to instanciate property in xmpMM:History
>>>>
>>>> 28
>>>>
>>>> xmp should start with a processing instruction
>>>>
>>>> 26
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://prismstandard.org/namespaces/basic/2.0/
>>>>
>>>> 24
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.npes.org/pdfx/ns/id/
>>>>
>>>> 21
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>>
>>>> 14
>>>>
>>>> Invalid array definition, expecting Seq and found
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>>> name=creator]
>>>>
>>>> 14
>>>>
>>>> Failed to instanciate property in xmp:MetadataDate
>>>>
>>>> 12
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.xinet.com/webnative/private/1.0/
>>>>
>>>> 10
>>>>
>>>> Failed to instanciate property in xap:ModifyDate
>>>>
>>>> 10
>>>>
>>>> Failed to instanciate property in xmp:ModifyDate
>>>>
>>>> 10
>>>>
>>>> Type 'params' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>>
>>>> 9
>>>>
>>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM;
>>>> name=History]
>>>>
>>>> 8
>>>>
>>>> Type 'documentName' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>>
>>>> 8
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.day.com/dam/1.0
>>>>
>>>> 7
>>>>
>>>> Cannot find a definition for the namespace ptc
>>>>
>>>> 7
>>>>
>>>> Failed to instanciate property in xapMM:History
>>>>
>>>> 6
>>>>
>>>> Invalid array definition, expecting Seq and found
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl
>>>> [prefix=tiff; name=YCbCrPositioning]
>>>>
>>>> 5
>>>>
>>>> Schema is not set in this document :
>>>> http://purl.org/dc/elements/1.1/
>>>>
>>>> 5
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.extensis.com/meta/FontSense/
>>>>
>>>> 4
>>>>
>>>> Excepted xpacket 'end' attribute (must be present and placed in
>>>> first)
>>>>
>>>> 4
>>>>
>>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop;
>>>> name=TextLayers]
>>>>
>>>> 3
>>>>
>>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>>
>>>> 3
>>>>
>>>> no message (NPE)
>>>>
>>>> 2
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://laserfiche.com/xmp/schema/1.0/
>>>>
>>>> 2
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>>
>>>> 2
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>>
>>>> 2
>>>>
>>>> Failed to instanciate property in xapRights:Marked
>>>>
>>>> 2
>>>>
>>>> Invalid array type, expecting Alt and found Bag [prefix=dc;
>>>> name=title]
>>>>
>>>> 2
>>>>
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>>>> name=title]
>>>>
>>>> 2
>>>>
>>>> Invalid array type, expecting Seq and found Alt [prefix=dc;
>>>> name=creator]
>>>>
>>>> 2
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.cambridgeassociates.com/status/1.0/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.computershare.com.au/ccs/1.0/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://ns.tripletriangle.com/ns/tripletri/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://prismstandard.org/namespaces/basic/2.1/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.aiim.org/pdfa/ns/id.html
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.aiim.org/pdfe/ns/id/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>>
>>>> 1
>>>>
>>>> Cannot find a definition for the namespace
>>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>>
>>>> 1
>>>>
>>>> Failed to instanciate property in xmpRights:Marked
>>>>
>>>> 1
>>>>
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc;
>>>> name=date]
>>>>
>>>> 1
>>>>
>>>> This namespace is not a schema or a structured type :
>>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>>
>>>> 1
>>>>
>>>>
>>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>
|