tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <nick.bu...@alfresco.com>
Subject RE: Metadata situation and XMP support in Tika
Date Tue, 24 Apr 2012 11:43:35 GMT
On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
> I think it would be more clear if parsers/clients would use the 
> namespace or standard properties explicitly instead of the metadata 
> one. But your idea of having a set of "standard" properties available in 
> the Metadata class would be a good help for clients who don't care which 
> "title" or "author" they read. They could just say "Metadata.title" 
> instead of "DublinCore.title".

One thing to bear in mind is that we've tried to hide the differences in 
format's metadata from end users of Tika. You shouldn't need to know if a 
format calls it "description" or "subject" or "title" or "dc:title" or 
"WhatItsAllAbout". Someone who understands the format works out how to map 
the file format's metadata onto a common set. End users can then say "give 
me the metadata that best fits the idea of Title, as defined by Dublin 
Core" and they get something back. The intricacies of the file formats 
are hidden from them, they get clean and consistent metadata back.

I certainly see there are cases when someone may want the full set of 
metadata back from a file, in quite a low level way, but we should make 
sure we don't loose the ability of users to say "give me the title of that 
document, no matter what the format stores it as" that we currently have


View raw message