nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vangelis Karvounis (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
Date Mon, 10 Mar 2014 18:41:42 GMT


Vangelis Karvounis updated NUTCH-1478:

    Attachment: NUTCH-1478v5.1.patch

I have made a patch but I don't know if I have done it correct.. :P
Anyway, my goal here was to input both property and rel tags. I would be glad if I could be
of any help!

If you want to patch this version, you need to alter the plugin/parse-metatags/
from the latest v5 patch as following:

Add the following code just before 'return parse' inside the method ParseFilter(String url,
WebPage page, Parse parse,HTMLMetaTags metaTags, DocumentFragment doc)

Properties property = metaTags.getPropertyTags();
    Enumeration<?> properNames = property.propertyNames();
    while (properNames.hasMoreElements()) {
        String name1 = (String) properNames.nextElement();
        String value1 = property.getProperty(name1);
        if (metatagset.contains("*") || metatagset.contains(name1.toLowerCase())) {
            LOG.debug("Found meta tag : " + name1 + "\t" + value1);
            //System.out.println("Found meta tag : " + name1 + "\t" + value1);
            page.putToMetadata(new Utf8(PARSE_META_PREFIX + name1.toLowerCase()),

   Properties relProp = metaTags.getRelTags();
    Enumeration<?> relNames = relProp.propertyNames();
    while (relNames.hasMoreElements()) {
        String name2 = (String) relNames.nextElement();
        String value2 = relProp.getProperty(name2);
        if (metatagset.contains("*") || metatagset.contains(name2.toLowerCase())) {
            LOG.debug("Found meta tag : " + name2 + "\t" + value2);
            //System.out.println("Found meta tag : " + name1 + "\t" + value1);
            page.putToMetadata(new Utf8(PARSE_META_PREFIX + name2.toLowerCase()),

    //System.out.println("	"+metaTags.toString());

> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>                 Key: NUTCH-1478
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>             Fix For: 2.3
>         Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, NUTCH-1478v4.patch,
NUTCH-1478v5.1.patch, NUTCH-1478v5.patch, Nutch1478.patch,, metadata_parseChecker_sites.png
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  This will
take multiple values of same tag and index in Solr as i patched before (
> The usage is same as described here ( but
one change is that there is no need to give 'metatag' keyword before metatag names. For example
my configuration looks like this (

> This is only the first version and does not include the junit test. I will update the
new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the fields
in '' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

This message was sent by Atlassian JIRA

View raw message