nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Not renewing CrawlDatum on Inject
Date Mon, 09 Jul 2007 19:17:59 GMT
Robert Young wrote:
> I have been trying to get to grips with
> org.apache.nutch.crawl.Injector to help with a requirement I have for
> the project I'm working on and I'm a little confused about one place.
> On lines 120 - 121 any existing CrawlDatum is used instead of the
> newly injected one. This doesn't seem to make sense from my point of
> view, I'm guessing it's just a matter of not being able to see the
> issue from the other side. The scenario I an in is as such, when I
> inject a url it is because I want it to be re-indexed, maybe because
> it's changed, I don't care if that url's already in the crawldb I want
> it re-indexed. As far as I can see, if this wasn't the case I wouldn't
> be trying to inject it.
> What am I missing here? Why is the existing CrawlDatum used instead of
> the newly injected one?

That's indeed a place in Nutch that I planned to change for a long time 
... This behavior is not obvious, what's worse it's undocumented.

It would be relatively simple to extend this behavior so that only 
selected parts of data would be updated or replaced when a seed list 
contains the same URL as the one already in CrawlDb.

For now, just add the code that you need in Injector.InjectReducer.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message