nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <>
Subject [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Date Tue, 20 Dec 2005 15:24:33 GMT
    [ ] 

Chris A. Mattmann commented on NUTCH-139:


 Okay, I just finished reading the rest of the comments :-) Sorry, just woke up out here in
Los  Angeles. Okay, I think I understand what you guys are getting at here. X-nutch should
be the "reliable" metadata that we create, i.e., control the input and output dataflow from
in Nutch right now. The other names, such as "Content-type", "Content-length" that are written
at the protocol-layer, and that Nutch doesn't control how they are written into the properties
objects, those names should be left alone then, no? Is that the jist of it. So, you guys propose
we would have something like:

//some protocol layer plugin
String contentType = getHeader("Content-type");

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);


If this is the case, then I would still point out that there are still metadata names like
"Content-type", that would be good to standardize on at the protocol level (shameless self
plug of what I already did ;) ) on how they are read. You could call these other metadata
names, since they aren't prefixed with X-nutch, like some other class, or something, but I
think it's still important at the protocol level to just standarize the code if nothing less
there. So the above example would become:

//some protocol layer plugin
String contentType = getHeader(PROTO_LYR_READ_ONLY_CONTENT_TYPE); //you're such a non-standard
property, aren't you

//CONTENT_TYPE = "X-nutch-content-type";
propertiesObject.put(CONTENT_TYPE, contentType);

So, I think it's a good compromise to not only standardize on what we write/read, but also
what we read only. Of course, I'm open to comments on this. :-)

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>          Key: NUTCH-139
>          URL:
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug
is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt,
> Currently, people are free to name their string-based properties anything that they want,
such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same
meaning. Stefan G. I believe proposed a solution in which all property names be converted
to lower case, but in essence this really only fixes half the problem right (the case of identifying
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings
in the ParseData class that the protocol framework and the parsing framework could use to
identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that
they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE)
to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml");
Of course, this wouldn't preclude users from doing what they are currently doing, it would
just provide a standard method of obtaining some of the more common, critical metadata without
pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses
this issue.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message