nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Date Tue, 20 Dec 2005 15:13:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360929 ] 

Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Andrzej,

> I have an objection, in fact I think the patches miss the main point of using of prefixed
property names.

D'oh!

> In this patch only some of the property names, specifically those corresponding to the
Dublin Core, are prefixed with PREFIX. Why? 

Well the reason behind this was kind of like this. I wanted the metadata property names to
be reusable, across the protocol level code, the parser code, pretty much anywhere that you
used what I would call  "common" metadata properties in Nutch. Now, at the protocol level
especially, there were bits and pieces of code like, "readHeaders("Content-type"), or String
someValue = getHeader("Content-length"), blah blah blah", where the code was physically reading
properties that were already written to an object, and that nutch has no control over. In
these cases, in order to make all the calls synonomous, e.g., a call to readHeaders("Content-type")
gets replaced by readHeaders(CONTENT_TYPE), I couldn't use the "_X_nutch" prefix on the names,
because I didn't write the value into those objects originally.

On the other hand, anywhere that I was able to physically add metadata properties that were
under our control, at the protocol level, or parsing level, etc., in particular, all of the
DC properties, we had control as to how they were getting added into the properties object
that was being passed around: both input control, and control over where it was being read,
so we could use the X_nutch prefix.

So, in my mind I saw two distinct types of standard metadata properties: those which we can
control both the input and output data flow from, and those which we really can only control
the output  data flow from.

> The original reason for introducing the prefix was this: as Nutch processes the raw data,
it extracts certain metadata, either directly or > using heuristics (like with LANG or
content type). In order to distinguish these values from the original raw values, the metadata

> processed by Nutch was to be prefixed by "X-nutch-", and all other metadata that we don't
use was to be left alone as it was.

This was followed to the T, except for the case above, which I mention and which you pointed
out. For example, what would have happened if I put CONTENT_TYPE="X_nutch_content_type", and
then I had a call in getHeaders(CONTENT_TYPE) in the protocol level? Well, since we don't
ever put CONTENT_TYPE into the headers properties object, that would really never help us,
and then everywhere we read CONTENT_TYPE, the value would have nothing. 

> So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the
mime-type detection plugin, and it should 
> put the final value of Content-Type in metadata - but under the name of "X-nutch-Content-Type",
in order to avoid overwriting the 
> original value (Chris's comment in MSWordParser.java reflects this doubt - that's the
reason for prefixing).

Yup, exactly. Good job catching that comment!

> Now, this convention is not followed in the patches. E.g. LANG is missing (should be
PREFIX + "lang"). 

Not sure I follow this one. In the patch, there's a line:

 public static final String LANGUAGE = NUTCH_PREFIX + "language";

?



> CharEncodingForConversion 
> doesn't have a prefix either. Properties extracted in plugins (e.g. msword, zip, file,
etc) are put under the standard, non-prefixed 
> names, thus overwriting the original values.

This isn't really true at all. I didn't overwrite any of the original values. In fact, no
values are really overwritten at all. There are only two cases really:

1. Places where I standardized on how the names are read: you see these at the bottom of MetadataNames.java.
These are properties that we don't really have control over how they got written into properties
object, or properties that I at least couldn't figure out how they got placed into the properties
objects at their particular layers. In this case, I've omitted the NUTCH_PREFIX in order to
make reading/(post-writing) of the properties work fine.

2. Places where I standardized on how the names are read/written. These are at the top of
MetadataNames.java. I could find the entire data flow in and out of the properties objects
at the respective layers for all of these properties, and what's why they have the X-nutch
Prefix.  Make sense?





> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug
is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want,
such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same
meaning. Stefan G. I believe proposed a solution in which all property names be converted
to lower case, but in essence this really only fixes half the problem right (the case of identifying
that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings
in the ParseData class that the protocol framework and the parsing framework could use to
identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that
they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE)
to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml");
Of course, this wouldn't preclude users from doing what they are currently doing, it would
just provide a standard method of obtaining some of the more common, critical metadata without
pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses
this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message