nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bartosz Gadzimski <bartek...@o2.pl>
Subject Re: planning for nutch-1.0-rc1
Date Mon, 09 Mar 2009 11:21:27 GMT
Hello,

It's on 2 linux boxes one with centos and one with ubuntu. Both properly 
running "old" bin/nutch crawl.
Problem is that it doesn't give exception on command line or in eclipse 
just writes to logs so it's hard to debug.

One is running nutch trunk from 07 march, and one from todays rc1

Any hints? Maybe some logs properties or sth?

In hadoop.log it looks exactly the same:

2009-03-09 12:12:09,452 INFO  plugin.PluginRepository -         Nutch 
Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-03-09 12:12:09,452 INFO  plugin.PluginRepository -         Ontology 
Model Loader (org.apache.nutch.ontology.Ontology)
2009-03-09 12:12:09,560 INFO  field.FieldIndexer - IFD [Thread-11]: 
setInfoStream 
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@6210fb
2009-03-09 12:12:09,560 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-agniesia441/mapred/local/index/_-174719952

autoCommit=true
mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@48edb5 
mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@1ee2c2c 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=10000 index=
2009-03-09 12:12:09,585 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
        at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
        at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:1)
        at 
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
        at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:1)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-09 12:12:10,021 FATAL field.FieldIndexer - FieldIndexer: 
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at 
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
        at 
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at 
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)


Thanks,
Bartosz


Dennis Kubes pisze:
> Sorry about the docs being sparse on this.  I will write more about 
> the process as time permits.  Don't know about the problem below.  
> What platform are you running on, windows, linux?
>
> Dennis
>
> Bartosz Gadzimski wrote:
>> Hello,
>>
>> Thanks Dennis for updateing wiki it helped a lot.
>>
>> You gave example with indexing but you didn't said a bit about it. 
>> Can you write some more? :)
>>
>> Anyways I have problems at the last step (nutch from 07 march):
>>
>> bin/nutch org.apache.nutch.indexer.field.FieldIndexer
>>
>> It simply stops somewhere
>>
>> 2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: 
>> starting
>> 2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/basicfields
>> 2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: 
>> adding fields db: crawl/fields/anchorfields
>> 2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: 
>> looking in: /usr/local/nutch/plugins
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
>> Auto-activation mode: [true]
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered 
>> Plugins:
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the 
>> nutch core extension points (nutch-extensionpoints)
>> 2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic 
>> Query Filter (query-basic)
>> .... plugins....
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
>> setInfoStream 
>> deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b 
>>
>> 2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
>> setInfoStream: 
>> dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313

>> autoCommit=true 
>> mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 
>> mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b 
>> ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
>> maxFieldLength=10000 index=
>> 2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
>> java.lang.NullPointerException
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)

>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)

>>
>>        at 
>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239) 
>>
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>> 2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
>> java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at 
>> org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)
>>
>>
>>
>>
>> In crawl/indexes is only _temporary folder.
>>
>> I will try to debug this but have problems with running nutch in eclipse
>>
>> Thanks,
>> Bartosz
>>
>>
>>
>> Dennis Kubes pisze:
>>> I don't know if I would make this primary yet.  I need to check what 
>>> is causing this as it worked fine for me, in fact we currently have 
>>> it in production.  Also we would need to update the shell scripts to 
>>> integrate this more tightly.
>>>
>>> Dennis
>>>
>>> Bartosz Gadzimski wrote:
>>>> Sami Siren pisze:
>>>>> Andrzej Bialecki wrote:
>>>>>> Sami Siren wrote:
>>>>>>> I am planning to build the first rc for nutch 1.0 at Tue 
>>>>>>> 3.3.2009 morning (EET). There are still some issues marked as

>>>>>>> fix for 1.0 in Jira. Neither of the two remaining _bugs_ seems

>>>>>>> too important to me, actually I only count the issues assigned

>>>>>>> to developers as real candidates to be included in 1.0:
>>>>>>>
>>>>>>> NUTCH-578 (kubes)
>>>>>>> NUTCH-477 (ab)
>>>>>>> NUTCH-669 (siren)
>>>>>>
>>>>>> There's one Critical issue reported, related to NekoHTML 
>>>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps 
>>>>>> downgrading is the safest course of action.
>>>>> I will take care of that.
>>>>>>
>>>>>>
>>>>>>> I am also volunteering to push all open issues to 1.1 before

>>>>>>> starting the RC build on Tuesday. Any objections on the proposed

>>>>>>> procedure or timing?
>>>>>>
>>>>>> Sounds good.
>>>>> great!
>>>>>
>>>>> -- 
>>>>> Sami Siren
>>>>>
>>>>>
>>>>>
>>>> What about new scoring and new indexing? Will it be integrated as a 
>>>> primary scoring algorithm? I have problem with it on LinkRank:
>>>>
>>>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>>>> counter job
>>>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>>>> counter job
>>>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>>>> temp file
>>>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>>>> java.lang.NullPointerException
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)

>>>>
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>        at 
>>>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>>>
>>>> Another question what about indexing framework mentioned here:
>>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>>>
>>>>
>>>> Have all those new scoring and indexing would be real step forward.
>>>>
>>>> Thanks,
>>>> Bartosz
>>>>
>>>
>>
>


Mime
View raw message