nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kiran chitturi <chitturikira...@gmail.com>
Subject Re: [jira] [Commented] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'
Date Tue, 01 Jan 2013 20:15:08 GMT
Hi Jaap,

It has worked previously for me with mysql. I am using Hbase now and
everything is going quite well too.

I am gonna try working with mysql to solve this issue,  i need little more
details.

Did you try to crawl nutch website or anything more ?

Did you define index.parse.md in the nutch-site.xml and also the fields in
the schema ?

Did you restart Solr once you created the schema ? Which nutch version are
you using ?

Did you check the Solr logs ?

Thank you,
Kiran.

On Tue, Jan 1, 2013 at 1:22 PM, J. Gobel (JIRA) <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541896#comment-13541896]
>
> J. Gobel commented on NUTCH-1511:
> ---------------------------------
>
> Hi Kiran,
>
> I never got it to work in Solr4. No matter what I tried, the fields
> metadata never shows up in Solr4. Do you index using HBase or Mysql? If
> times allows, please try it with MYSQL.
>
> Just add the table below in MYSQL. Or alternatively for a more thorough
> explanation check the guide on http://nlp.solutions.asia/?p=180
>
> CREATE TABLE `webpage` (
> `id` varchar(767) NOT NULL,
> `headers` blob,
> `text` mediumtext DEFAULT NULL,
> `status` int(11) DEFAULT NULL,
> `markers` blob,
> `parseStatus` blob,
> `modifiedTime` bigint(20) DEFAULT NULL,
> `score` float DEFAULT NULL,
> `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
> `baseUrl` varchar(767) DEFAULT NULL,
> `content` longblob,
> `title` varchar(2048) DEFAULT NULL,
> `reprUrl` varchar(767) DEFAULT NULL,
> `fetchInterval` int(11) DEFAULT NULL,
> `prevFetchTime` bigint(20) DEFAULT NULL,
> `inlinks` mediumblob,
> `prevSignature` blob,
> `outlinks` mediumblob,
> `fetchTime` bigint(20) DEFAULT NULL,
> `retriesSinceFetch` int(11) DEFAULT NULL,
> `protocolStatus` blob,
> `signature` blob,
> `metadata` blob,
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB
> ROW_FORMAT=COMPRESSED
> DEFAULT CHARSET=utf8mb4;
>
> rgds,
>
> Jaap
>
> > Metadata in MYSQL updated with 'garbage'
> > ----------------------------------------
> >
> >                 Key: NUTCH-1511
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
> >             Project: Nutch
> >          Issue Type: Bug
> >          Components: fetcher
> >    Affects Versions: 2.1
> >         Environment: Ubuntu 12.04
> >            Reporter: J. Gobel
> >              Labels: metadata, mysql, nutch
> >
> > After applying patch for Metadata parser (NUTCH-1478) I notice that the
> metadata field just before the crawl ends is populated with the correct
> information. However when the crawl is completely finished the metadata
> field is populated with 'garbage' _csh_ �����
> > last few lines of my logfile:
> > p.s. I use : bin/nutch crawl urls -depth 1 -topN 5 ..
> > 013-01-01 11:55:53,177 INFO crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> > 2013-01-01 11:55:53,903 INFO parse.ParserJob - Parsing
> http://nutch.apache.com/
> > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag :
> robots index, follow
> > 2013-01-01 11:55:54,589 WARN parse.MetaTagsParser - Found meta tag :
> keywords .com.nl .net.nl com.nl net.nl sld, tld, domain, registry, domain
> registry, nic, extention, icann
> > 2013-01-01 11:55:54,590 WARN parse.MetaTagsParser - Found meta tag :
> description Registreer nu uw .com.nl of .net.nl extentie.
> > 2013-01-01 11:55:54,619 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'outlink', using default
> > 2013-01-01 11:55:55,240 WARN mapred.FileOutputCommitter - Output path is
> null in cleanup
> > 2013-01-01 11:55:56,652 INFO mapreduce.GoraRecordReader -
> gora.buffer.read.limit = 10000
> > 2013-01-01 11:55:59,574 INFO mapreduce.GoraRecordWriter -
> gora.buffer.write.limit = 10000
> > 2013-01-01 11:55:59,575 INFO crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> > 2013-01-01 11:55:59,575 INFO crawl.AbstractFetchSchedule -
> maxInterval=7776000
> > 2013-01-01 11:56:02,554 WARN mapred.FileOutputCommitter - Output path is
> null in cleanup
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
Kiran Chitturi

Mime
View raw message