nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Young" <bubble...@gmail.com>
Subject Re: Not renewing CrawlDatum on Inject
Date Tue, 10 Jul 2007 08:19:26 GMT
Would you say it's worth writing it up as a patch and adding it to JIRA?

On 7/9/07, Andrzej Bialecki <ab@getopt.org> wrote:
> Robert Young wrote:
> > I have been trying to get to grips with
> > org.apache.nutch.crawl.Injector to help with a requirement I have for
> > the project I'm working on and I'm a little confused about one place.
> > On lines 120 - 121 any existing CrawlDatum is used instead of the
> > newly injected one. This doesn't seem to make sense from my point of
> > view, I'm guessing it's just a matter of not being able to see the
> > issue from the other side. The scenario I an in is as such, when I
> > inject a url it is because I want it to be re-indexed, maybe because
> > it's changed, I don't care if that url's already in the crawldb I want
> > it re-indexed. As far as I can see, if this wasn't the case I wouldn't
> > be trying to inject it.
> >
> > What am I missing here? Why is the existing CrawlDatum used instead of
> > the newly injected one?
>
> That's indeed a place in Nutch that I planned to change for a long time
> ... This behavior is not obvious, what's worse it's undocumented.
>
> It would be relatively simple to extend this behavior so that only
> selected parts of data would be updated or replaced when a seed list
> contains the same URL as the one already in CrawlDb.
>
> For now, just add the code that you need in Injector.InjectReducer.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Mime
View raw message