nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Date Tue, 20 Dec 2005 10:21:30 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360901 ] 

Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

I have an objection, in fact I think the patches miss the main point of using of prefixed
property names.

In this patch only some of the property names, specifically those corresponding to the Dublin
Core, are prefixed with PREFIX. Why? The original reason for introducing the prefix was this:
as Nutch processes the raw data, it extracts certain metadata, either directly or using heuristics
(like with LANG or content type). In order to distinguish these values from the original raw
values, the metadata processed by Nutch was to be prefixed by "X-nutch-", and all other metadata
that we don't use was to be left alone as it was.

So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with e.g. the mime-type
detection plugin, and it should put the final value of Content-Type in metadata - but under
the name of "X-nutch-Content-Type", in order to avoid overwriting the original value (Chris's
comment in MSWordParser.java reflects this doubt - that's the reason for prefixing).

Now, this convention is not followed in the patches. E.g. LANG is missing (should be PREFIX
+ "lang"). CharEncodingForConversion doesn't have a prefix either. Properties extracted in
plugins (e.g. msword, zip, file, etc) are put under the standard, non-prefixed names, thus
overwriting the original values.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, although bug
is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything that they want,
such as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having the same
meaning. Stefan G. I believe proposed a solution in which all property names be converted
to lower case, but in essence this really only fixes half the problem right (the case of identifying
that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of named Strings
in the ParseData class that the protocol framework and the parsing framework could use to
identify common properties such as "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard properties that
they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE)
to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml");
Of course, this wouldn't preclude users from doing what they are currently doing, it would
just provide a standard method of obtaining some of the more common, critical metadata without
pouring over the code base to figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week that addresses
this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message