nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Byron Miller <>
Subject Re: Per-page crawling policy
Date Thu, 05 Jan 2006 14:32:49 GMT
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.

I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
computer generated map (via standard link processing,
anchor results and such)

My only continuing question is how to manage the
merge, index process of staging/processing your
crawl/fetch jobs such as this.  It seems all of our
theories would be a single crawl and publish of that
index rather than a living/breathing corpus.

Unless we map/bucket the segments to have some purpose
it's difficult to manage how we process them, sort
them or analyze them to defign or extra more meaning
from them.

Brain is exploding :)


--- Andrzej Bialecki <> wrote:

> Hi,
> I've been toying with the following idea, which is
> an extension of the 
> existing URLFilter mechanism and the concept of a
> "crawl frontier".
> Let's suppose we have several initial seed urls,
> each with a different 
> subjective quality. We would like to crawl these,
> and expand the 
> "crawling frontier" using outlinks. However, we
> don't want to do it 
> uniformly for every initial url, but rather
> propagate certain "crawling 
> policy" through the expanding trees of linked pages.
> This "crawling 
> policy" could consist of url filters, scoring
> methods, etc - basically 
> anything configurable in Nutch could be included in
> this "policy". 
> Perhaps it could even be the new version of
> non-static NutchConf ;-)
> Then, if a given initial url is a known high-quality
> source, we would 
> like to apply a "favor" policy, where we e.g. add
> pages linked from that 
> url, and in doing so we give them a higher score.
> Recursively, we could 
> apply the same policy for the next generation pages,
> or perhaps only for 
> pages belonging to the same domain. So, in a sense
> the original notion 
> of high-quality would cascade down to other linked
> pages. The important 
> aspect of this to note is that all newly discovered
> pages would be 
> subject to the same policy - unless we have
> compelling reasons to switch 
> the policy (from "favor" to "default" or to
> "distrust"), which at that 
> point would essentially change the shape of the
> expanding frontier.
> If a given initial url is a known spammer, we would
> like to apply a 
> "distrust" policy for adding pages linked from that
> url (e.g. adding or 
> not adding, if adding then lowering their score, or
> applying different 
> score calculation). And recursively we could apply a
> similar policy of 
> "distrust" to any pages discovered this way. We
> could also change the 
> policy on the way, if there are compelling reasons
> to do so. This means 
> that we could follow some high-quality links from
> low-quality pages, 
> without drilling down the sites which are known to
> be of low quality.
> Special care needs to be taken if the same page is
> discovered from pages 
> with different policies, I haven't thought about
> this aspect yet... ;-)
> What would be the benefits of such approach?
> * the initial page + policy would both control the
> expanding crawling 
> frontier, and it could be differently defined for
> different starting 
> pages. I.e. in a single web database we could keep
> different 
> "collections" or "areas of interest" with
> differently specified 
> policies. But still we could reap the benefits of a
> single web db, 
> namely the link information.
> * URLFilters could be grouped into several policies,
> and it would be 
> easy to switch between them, or edit them.
> * if the crawl process realizes it ended up on a
> spam page, it can 
> switch the page policy to "distrust", or the other
> way around, and stop 
> crawling unwanted content. From now on the pages
> linked from that page 
> will follow the new policy. In other words, if a
> crawling frontier 
> reaches pages with known quality problems, it would
> be easy to change 
> the policy on-the-fly to avoid them or pages linked
> from them, without 
> resorting to modifications of URLFilters.
> Some of the above you can do even now with
> URLFilters, but any change 
> you do now has global consequences. You may also end
> up with awfully 
> complicated rules if you try to cover all cases in
> one rule set.
> How to implement it? Surprisingly, I think that it's
> very simple - just 
> adding a CrawlDatum.policyId field would suffice,
> assuming we have a 
> means to store and retrieve these policies by ID;
> and then instantiate 
> it and call appropriate methods whenever we use
> today the URLFilters and 
> do the score calculations.
> Any comments?
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
>  Contact: info at sigram dot
> com

View raw message