nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Date Wed, 25 Jan 2006 11:28:10 GMT
    [ ] 

Andrzej Bialecki  commented on NUTCH-139:

Yes, this should work ok ... but it strikes me as unnecessarily complicated. After all, in
most cases we will have single values and no overrides, so this solution complicates the most
common cases...

At this point it's probably easier just to keep the original <key, val[]> in one Map,
and potential overrides <key, val1[]> in another Map, and then provide a container/facade
with appropriate methods to add/get/set whichever value is necessary.


public class MetaData {
  private HashMap original = new HashMap();
  private HashMap actual = new HashMap();

  public void add(String key, String val) {
    // same as in ContentProperties now, uses the "original" map

  public void set(String key, String val) {
    // same as in ContentProperties now, uses the "original" map

  public void setFinal(String key, String val) {
   // as above, but uses the "actual" map

  // return the final value, if it's missing then return the original value
  public Object getFinal(String key) {
    Object res = actual.get(key);
    if (res == null) res = original.get(key);
    return res;

This seems to satisfy all the requirements, and with minimal overhead. If this is ok with
you, please prepare a patch, and we should commit it - there are many other changes waiting
in the queue that depend on this patch being applied ...

(BTW. I think it's conceptually the same as using the "X-nutch" to avoid name clashes, but
from the point of view of correct OO programming it looks more "kosher" now... ;-) )

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>          Key: NUTCH-139
>          URL:
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug
is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> Currently, people are free to name their string-based properties anything that they want,
such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same
meaning. Stefan G. I believe proposed a solution in which all property names be converted
to lower case, but in essence this really only fixes half the problem right (the case of identifying
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings
in the ParseData class that the protocol framework and the parsing framework could use to
identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that
they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE)
to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml");
Of course, this wouldn't preclude users from doing what they are currently doing, it would
just provide a standard method of obtaining some of the more common, critical metadata without
pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses
this issue.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message