tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Gauss II <ray.ga...@alfresco.com>
Subject Re: Metadata situation and XMP support in Tika
Date Tue, 24 Apr 2012 13:10:19 GMT
I think the aliasing approach supports both use cases nicely, i.e.:

Metatadata.java:
...
   Property TITLE = DublinCore.DC_TITLE;
...

Users then only have to concern themselves with "give me the metadata that best fits the idea
of Title, as defined by Tika", and not even have to know about DublinCore, but can dig into
details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since it is such
as basic, broad, and widely accepted standard, but for other standards that direct inclusion
in the Metadata interface makes less sense.  For example, at the moment we're essentially
asking users to say "give me the metadata that best fits the idea of Keywords, as defined
by MSOffice" which doesn't make a lot of sense when dealing with something like images.  If
we aliased:

Metatadata.java:
...
   Property KEYWORDS = MSOffice.MS_KEYWORDS;
...

we're back to the intended "give me the metadata that best fits the idea of Keywords, as defined
by Tika".  In this case, DublinCore.DC_SUBJECT is probably a much better standard to alias
keywords from than MSOffice, but I'm just sticking to the current mappings for this example.

Ray


On Apr 24, 2012, at 7:43 AM, Nick Burch wrote:

> On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
>> I think it would be more clear if parsers/clients would use the namespace or standard
properties explicitly instead of the metadata one. But your idea of having a set of "standard"
properties available in the Metadata class would be a good help for clients who don't care
which "title" or "author" they read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> One thing to bear in mind is that we've tried to hide the differences in format's metadata
from end users of Tika. You shouldn't need to know if a format calls it "description" or "subject"
or "title" or "dc:title" or "WhatItsAllAbout". Someone who understands the format works out
how to map the file format's metadata onto a common set. End users can then say "give me the
metadata that best fits the idea of Title, as defined by Dublin Core" and they get something
back. The intricacies of the file formats are hidden from them, they get clean and consistent
metadata back.
> 
> I certainly see there are cases when someone may want the full set of metadata back from
a file, in quite a low level way, but we should make sure we don't loose the ability of users
to say "give me the title of that document, no matter what the format stores it as" that we
currently have
> 
> Nick


Mime
View raw message