nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Gobel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
Date Tue, 01 Jan 2013 12:08:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541626#comment-13541626
] 

J. Gobel commented on NUTCH-1478:
---------------------------------

Hi Kiran,

I have spent some time checking and monitoring the updates in my MSQL Metadata fiel. And something
odd is happening.
Just before the crawling is finished, the metadata field is updated with correct information,
I can see the field being updated with robotsindex, follow description etc. . But as soon
as it finished the metadata field is updated to :_csh_�����

I copy pasted my log here below (just the last lines). I am aware that there are still some
issues with MYSQL as backend for Nutch 2.x 


013-01-01 11:55:53,177 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2013-01-01 11:55:53,903 INFO  parse.ParserJob - Parsing http://nutch.apache.com/
2013-01-01 11:55:54,589 WARN  parse.MetaTagsParser - Found meta tag : robots	index, follow
2013-01-01 11:55:54,589 WARN  parse.MetaTagsParser - Found meta tag : keywords	.com.nl .net.nl
com.nl net.nl sld, tld, domain, registry, domain registry, nic, extention, icann
2013-01-01 11:55:54,590 WARN  parse.MetaTagsParser - Found meta tag : description	Registreer
nu uw .com.nl of .net.nl extentie.
2013-01-01 11:55:54,619 INFO  regex.RegexURLNormalizer - can't find rules for scope 'outlink',
using default
2013-01-01 11:55:55,240 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2013-01-01 11:55:56,652 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-01-01 11:55:59,574 INFO  mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2013-01-01 11:55:59,575 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-01-01 11:55:59,575 INFO  crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-01-01 11:55:59,575 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-01-01 11:56:02,554 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
                
> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>         Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  This will
take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but
one change is that there is no need to give 'metatag' keyword before metatag names. For example
my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)

> This is only the first version and does not include the junit test. I will update the
new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the fields
in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message