tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject Re: Using standard XMP schemas for image and audio metadata
Date Sun, 08 Feb 2009 05:22:54 GMT

On Feb 7, 2009, at 11:32 AM, Jukka Zitting wrote:
> The current image and audio parsers use hardcoded strings like
> "width", "height", "encoding" and "samplerate" for extracted metadata.
> The semantics of these metadata keys are nowhere documented and little
> thought has been put on interoperability with external metadata
> applications. To improve things I'd like to replace these custom
> metadata keys with keys defined in part 2 of the XMP specification
> [1].
> More specifically, I'd like to start using the following keys for
> image and audio metadata:
>    * "tiff:ImageWidth" instead of "width"
>    * "tiff:ImageHeight" instead of "height"
>    * "xmpDM:audioCompressor" instead of "encoding"
>    * "xmpDM:audioSampleRate" instead of "samplerate"
>    * "xmpDM:audioSampleType" instead of "bits"
>    * "xmpDM:audioChannelType" instead of "channels"

Why would you want to use a tag that implies that the underlying data  
is TIFF when it isn't (e.g. JPEG)?  That strikes me as a REALLY Bad  
Idea(tm).  The reason why Adobe put this out and is using TIFF tags is  
because they target Photoshop to professional photographers that take  
12 megapixel shots and store them as uncompressed TIFFs.  It's the  
path of least resistance for them, since they already support TIFF  
tags.  Correctness isn't even fourth on their list of priorities.  If  
this was from Apple, they'd be talking about iPhoto, and so you would  
have gotten jpg:wdth, because the average consumer takes JPEGs.  This  
isn't even really a spec as much as it's Adobe saying, "This is what  
we're already doing and we're not changing.  If you want to play,  
these are the rules.  Deal with it."  While appropriate for  
interoperability with Adobe CreateSuite, this isn't really for general  

The problem with all these metadata standards is that they're all dumb  
in the sense that they duplicate effort.  What is the the  
philosophical difference. between: xmpDM:artist,
tiff:Artist, and dc:creator?  These examples were culled from Adobe's  
XMP "spec" you linked to.  Throw in id3:artist, pdf:author,  and  
literally countless others, and you can begin to appreciate the sheer  
number of metadata tags that mean "person or organization from which  
this artifact originates."[*]

You're already converting metadata from one ontology to another,  
whether you realize it or not, each one of which has its own biases  
and shortcomings.  Currently you're converting from whatever metadata  
ontology the file has toTika's implicit ontology.  I consider this a  
Good Thing(tm).  As a developer I shouldn't have to know what esoteric  
keys are used to store what metadata in whatever specific file I'm  
reading, no more than I have to know how to get the text out of the  
file.  Tika handles that for me, and that's why I like it.  It's  
someone else's problem.

Metadata ontologies are already such a mess, because of historical,  
not-invented-here, and I-know-better-than-everyone-else reasons.   
Fundamentally, they're just key-value pairs, so who cares?  Just wrap  
whatever key-value pairs that are detected with some namespace thing  
to avoid name collisions, and copy the metadata to some generic Tika  
ontology.  That way the user has a common interface to whatever  
metadata he/she wants, but at the same time has access to the raw  
metadata if need be.  Even if you ended up duplicating all the  
metadata, we're dealing with what?  20 keys?  It's trivial.

Sympathizing with the Universalist camp, I say there's no reason why  
you can't combine metadata from a variety of ontologies, and then have  
the values interpreted appropriately according to whatever document  
type the user is interrogating.  Say we're dealing with the concept of  
"length".  This represents a variety of concepts, but typically either  
a spatial or temporal measurement.  No one is going to interpret the  
"length" tag for an audio file as being meters, and if they do,  
they're dumb.

In summary, my objections are:

1. XMP are that it's lazily written.
2. XMP was never intended to solve the problem at hand.
3. There needs to be a clean interface.  A hodgepodge of competing or  
at best quasi-interoperable standards isn't clean.
4.  They're just key-value pairs.  It doesn't cost anything to add  
more, so just add everything.

[*] I can't help but think that this touches on the Problem of  
Universals, which has been around for about 2400 years.

Jonathan Koren

View raw message