tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Metadata Namespaces
Date Mon, 24 Nov 2008 22:47:44 GMT
What do people think of adding some type of namespace to the Metadata  
attributes (Dublin, CC, etc.).   I think this would allow us to  
discern where the metadata came from.  For instance, in Solr, see https://issues.apache.org/jira/browse/SOLR-284?focusedCommentId=12650353

#action_12650353

I can see doing this a few different ways:

1. Allow the user to pass in a String that gets prefixed to all  
metadata names, with a constructor like:
  public Metadata(String namespace){
     metadata = new HashMap<String, String[]>();
     this.namespace = namespace;
   }

and then anytime a key is needed, the namespace string is potentially  
prepended

2. Prefix the "core" attributes with "tika."

3. Prefix each sub-attribute appropriately, such as "dc.format" for  
the DublinCore Format attribute.

4. Combine 2 and 3.  We could try something a bit more involved to  
have a way to formally define it like tika.dc.format, such that I  
could know that this attribute is core to Tika, from Dublin Core and  
is named Format.  Thus, say Solr adds in it's own parser that for  
whatever reason isn't contrib'ed back to Tika (just an example, I  
don't have anything in mind) I could create it's metadata attribs as  
solr.foo.bar or however I want to do it.

The default, I believe, should still be to have no namespace, i.e. the  
empty string namespace.

-Grant

Mime
View raw message