nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-809) Parse-metatags plugin
Date Fri, 02 Apr 2010 14:16:27 GMT
Parse-metatags plugin
---------------------

                 Key: NUTCH-809
                 URL: https://issues.apache.org/jira/browse/NUTCH-809
             Project: Nutch
          Issue Type: New Feature
          Components: parser
            Reporter: Julien Nioche
            Assignee: Julien Nioche
         Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*


To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
<mimeType name="text/html">
  <plugin id="parse-html" />
</mimeType>
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of
metatag names with '*' as default value. The values are separated by ';'.

In order to extract the values of the metatags description and keywords, you must specify
in nutch-site.xml

{code:xml}
<property>
  <name>metatags.names</name>
  <value>description;keywords</value>
</property>
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and
'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message