tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Koren <jonat...@soe.ucsc.edu>
Subject Re: Using standard XMP schemas for image and audio metadata
Date Mon, 09 Feb 2009 14:11:14 GMT

On Feb 8, 2009, at 10:59 AM, Jukka Zitting wrote:

> Hi,
> On Sun, Feb 8, 2009 at 4:57 PM, Jonathan Koren  
> <jonathan@soe.ucsc.edu> wrote:
>> On Feb 8, 2009, at 5:55 AM, Jukka Zitting wrote:
>>> On Sun, Feb 8, 2009 at 6:22 AM, Jonathan Koren <jonathan@soe.ucsc.edu 
>>> >
>>> wrote:
>>>> The problem with all these metadata standards is that they're all  
>>>> dumb in
>>>> the sense that they duplicate effort.
>>> Agreed. So why would we want to duplicate the effort in Tika?
>> Because someone is going to be stuck doing it anyway.
> Why? The metadata keys I proposed are semantically equivalent to the
> custom keys we use now. Why would someone need to specify custom keys
> when standard alternatives for the exact same concepts already exist?
> Note that I'm only proposing that we change the keys of the six
> metadata entries I listed.

But why only those six?  It certainly seems like an arbitrary list  
based on temporary convenience.    You're not proposing to support all  
of XMP, just the bare minimum that you need this week.  At some point  
you're going to want to add more metadata and then you're going going  
to have to deal with the ontology mismatch problem.  By luck or design  
you've picked ones that do map 1-to-1 to some other ontology, but this  
doesn't hold across XMP and it doesn't scale across multiple  
ontologies, including the ontologies you're currently using.  When the  
day comes that you want to add more metadata, you haven't explained  
how you're going to solve the mismatch problem.

I don't understand what you do with the things that don't map 1-to-1  
with XMP.  Ignore them?  That doesn't work because then you're  
arbitrarily dictating what kinds of problems the user can solve.  Map  
them to some other space?  That doesn't  work either because then if  
the user wants to grab all the metadata from the foo space the user  
will have to know that foo:one gets mapped to bar:uno,  foo:two gets  
mapped to baz:cinco, and foo:three doesn't get mapped.  It's  
unreasonable to force such an ugly hack on all users just because it  
was easier to do this for one person once.

> I have a concrete use case where doing this would be beneficial: My
> employer is building a digital asset management application where we
> plan to leverage XMP for metadata handling. Rather than explicitly
> mapping each individual Tika metadata key to equivalent XMP entries,
> it would be much easier and clearer to just map the "tiff" and "xmlDM"
> prefixes to appropriate XMP namespaces when importing Tika metadata.
> We also wouldn't need to keep updating the metadata mappings whenever
> new Tika versions start supporting new keys.

I understand that you don't want to keep updating your own code every  
time Tika changes, but as you said, this is a 0.x release, so you're  
going to be stuck doing that for awhile.  What I don't understand is  
why naively hardcoding the requirements for your current project into  
a publicly available library is the appropriate place for this code.

> Is there some better way for us to implement this use case?

Yes.  Tika does no translation between ontologies.  It simply dumps  
all metadata detected for a file into its own namespace.  This means  
that an MS Office file gets an MS namespace.   Something with XMP gets  
an XMP namespace.  ID3 tags go into the ID3 namepsace.  Tika does no  
mapping among the types by default.  You create a new class that takes  
the raw key-value pairs that stored in Tika::Metadata and translates  
them to something else.  Call it Metadata2XMP or whatever.  That can  
be packaged within Tika as a convenient  class that does least common  
denominator mapping in a well defined way.    By breaking the mapping  
out to a class separate from Metadata, you avoid spreading a single  
metadata namespace across 15 namespaces, and you make all mapping 100%  
reversible (well in this case ignorable), since inevitably some will  
be wrong in some case.  If all a user wants is LCD metadata, they can  
get it through a common XMP namespace.

Jonathan Koren

View raw message