nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-62) Add html META tag information into metaData in index-more plugin
Date Tue, 07 Jun 2005 09:51:42 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-62?page=comments#action_12312857 ] 

Andrzej Bialecki  commented on NUTCH-62:
----------------------------------------

The latest SVN version already contains similar code (see parse-html/..../HTMLMetaProcessor.java).
The only thing that is missing is to put the content meta tags into ParseData.metadata.

As you know, we actually have two places to put metadata into: one is Protocol.metadata, where
all protocol-related metadata should be stored, and the other is ParseData.metadata, where
parse-related metadata should be stored, which is the case here.

However... potentially this may overwrite other properties coming from protocol handlers,
or discovered by other plugins or other parts of the code. E.g. the "lang" tag is such example,
"content-encoding" and "charset" are other examples. The language identifier plugin works
around this by using an "X-meta-lang" property name. (BTW: it could be cleaned up to avoid
traversing the node tree once again, but instead make use of the already discovered meta tags,
which are now passed as an argument to HtmlParseFilters).

I suggest to rework this to use a consistent schema in both cases (i.e. Content.metadata and
ParseData.metadata): let's put them  under "X-nutch-<name>-" (where <name> is
e.g. the value of the key retrieved from HtmlMetaTags.getGeneralTags()), or "X-nutch-http-equiv-<name>"
prefix (where name is the value of the key retrieved from HtmlMetaTags.getHtpEquivTags)),
and so on. So, this would be e.g. "X-nutch-robots", "X-nutch-base", "X-nutch-http-equiv-pragma",
"X-nutch-http-equiv-refresh").

This way we can store all <meta> information, without any danger of overwriting the
original values.

> Add html META tag information into metaData in index-more plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-62
>          URL: http://issues.apache.org/jira/browse/NUTCH-62
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jack Tang
>     Priority: Trivial
>  Attachments: index-more.patch.zip
>
> Now(version dev-0.7), only some metaData  in http response such as type, date, content-length
are available int the index-more plugin. And we cannot index/sotre the meta data in html header
(<META> exactly)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message