nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (Closed) (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (NUTCH-809) Parse-metatags plugin
Date Wed, 04 Apr 2012 14:51:24 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche closed NUTCH-809.
-------------------------------

    
> Parse-metatags plugin
> ---------------------
>
>                 Key: NUTCH-809
>                 URL: https://issues.apache.org/jira/browse/NUTCH-809
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.4, nutchgora
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.5
>
>         Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch,
metatags-plugin+tutorial.zip
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list
of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify
in nutch-site.xml
> {code:xml}
> <property>
>   <name>metatags.names</name>
>   <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords'
and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml
> {code:xml}
> <property>
>   <name>query.basic.description.boost</name>
>   <value>2.0</value>
> </property>
> <property>
>   <name>query.basic.keywords.boost</name>
>   <value>2.0</value>
> </property>
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message