nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bartosz Gadzimski <bartek...@o2.pl>
Subject Re: planning for nutch-1.0-rc1
Date Sun, 08 Mar 2009 17:26:54 GMT
Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Can 
you write some more? :)

Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding fields db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking in: /usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         the nutch core extension points
(nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
.... plugins....

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: setInfoStream deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1b4a74b
2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313
autoCommit=true mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@15356d5 mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@69d02b
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=10000 index=
2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
        at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139)
        at org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
        at org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: java.io.IOException: Job
failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
        at org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)




In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
> I don't know if I would make this primary yet.  I need to check what 
> is causing this as it worked fine for me, in fact we currently have it 
> in production.  Also we would need to update the shell scripts to 
> integrate this more tightly.
>
> Dennis
>
> Bartosz Gadzimski wrote:
>> Sami Siren pisze:
>>> Andrzej Bialecki wrote:
>>>> Sami Siren wrote:
>>>>> I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
>>>>> morning (EET). There are still some issues marked as fix for 1.0 
>>>>> in Jira. Neither of the two remaining _bugs_ seems too important 
>>>>> to me, actually I only count the issues assigned to developers as 
>>>>> real candidates to be included in 1.0:
>>>>>
>>>>> NUTCH-578 (kubes)
>>>>> NUTCH-477 (ab)
>>>>> NUTCH-669 (siren)
>>>>
>>>> There's one Critical issue reported, related to NekoHTML 
>>>> (NUTCH-700). I'm not sure what are the feature differences 
>>>> (pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading 
>>>> is the safest course of action.
>>> I will take care of that.
>>>>
>>>>
>>>>> I am also volunteering to push all open issues to 1.1 before 
>>>>> starting the RC build on Tuesday. Any objections on the proposed 
>>>>> procedure or timing?
>>>>
>>>> Sounds good.
>>> great!
>>>
>>> -- 
>>> Sami Siren
>>>
>>>
>>>
>> What about new scoring and new indexing? Will it be integrated as a 
>> primary scoring algorithm? I have problem with it on LinkRank:
>>
>> 2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
>> counter job
>> 2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
>> counter job
>> 2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
>> temp file
>> 2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
>> temp file
>> 2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
>> java.lang.NullPointerException
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at 
>> org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
>>
>> Another question what about indexing framework mentioned here:
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg11764.html
>>
>>
>> Have all those new scoring and indexing would be real step forward.
>>
>> Thanks,
>> Bartosz
>>
>


Mime
View raw message