tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject Re: Using standard XMP schemas for image and audio metadata
Date Sun, 08 Feb 2009 15:57:48 GMT

On Feb 8, 2009, at 5:55 AM, Jukka Zitting wrote:

> Hi,
>
> On Sun, Feb 8, 2009 at 6:22 AM, Jonathan Koren  
> <jonathan@soe.ucsc.edu> wrote:
>> The problem with all these metadata standards is that they're all  
>> dumb in
>> the sense that they duplicate effort.
>
> Agreed. So why would we want to duplicate the effort in Tika?

Because someone is going to be stuck doing it anyway.  The only  
question is whether it's going to be Tika, or the application using  
Tika.  Tika is in a better position to know what the variety of  
formats are and how they interrelate, better than any single  
application developer.  Tika already does this with respect to the  
barebones metadata of image and audio files  Picking some externally  
developed standard doesn't solve anything.  All it does is purport to  
absolve Tika of responsibility.

Say you want to export (because that's what we're really talking about  
here) Dublin Core.  MS Office doesn't support DC, it has its own  
ontology.  Not only do these ontologies not map one to one, they only  
sort of share one concept: ms:author and dc:creator.  The other  
concepts simply don't exist.  Sure you could perhaps cajole  
ms:lastauthor into dc:contributor, or ms:lastsavedate to dc:modified,  
but the vast majority of items simply have no counterpart in the other  
ontology.   Now whatever DC Tika would construct from the MS metadata  
would be wrong by definition (since the ontologies are being abused)  
or be so devoid of information, it might as well not even exist.

Now let's say we're dealing with two other metadata formats  You've  
got ID3v2 and you want to export out XMP.  XMP has xmpDM:artist, but  
your ID3 information has conflicting id3:artist and id3:albumartist  
tags.  Which one do you map, and which one do you lose?  More  
importantly, how do you tell the user that you might be mapped the  
wrong one?  If you use a Tika namespace for the lowest common  
denominator metadata, you not only have you provided an answer to the  
question "who's the artist?", but you've also told the user that the  
answer might be wrong.  This ability to express uncertainty simply  
doesn't exist any existing ontology because each ontology believes  
it's the One True Ontology, and that mappings from the inferior  
ontologies to the One True Ontology exists for at least all cases that  
any one cares about.

I STRONGLY believe that you're going to have to store all the raw  
metadata according to some set of Tika blessed namespaces (e.g. dc,  
id3, xmp, msoffice, exif, tiff, etc) in order to allow application  
developers to handle anything above the least common denominator of  
the various metadata formats.  No mapping among the ontologies exists  
that is going to satisfy everyone in all cases, so why should Tika  
keep users from making their own mappings if they really want to do  
that?  If you use an existing ontology, you're going to have to flag  
that it's synthesized from other metadata, and thus is suspect.   
Furthermore you're going to have be able to flag the synthesized data  
on a per key basis in order to avoid collisions between real and  
synthetic metadata within the exported namespace.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Mime
View raw message