nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrug...@krugle.net>
Subject Re: Per-page crawling policy
Date Mon, 16 Jan 2006 15:28:33 GMT
Hi Andrzej,

>I've been toying with the following idea, which is an extension of 
>the existing URLFilter mechanism and the concept of a "crawl 
>frontier".
>
>Let's suppose we have several initial seed urls, each with a 
>different subjective quality. We would like to crawl these, and 
>expand the "crawling frontier" using outlinks. However, we don't 
>want to do it uniformly for every initial url, but rather propagate 
>certain "crawling policy" through the expanding trees of linked 
>pages. This "crawling policy" could consist of url filters, scoring 
>methods, etc - basically anything configurable in Nutch could be 
>included in this "policy". Perhaps it could even be the new version 
>of non-static NutchConf ;-)
>
>Then, if a given initial url is a known high-quality source, we 
>would like to apply a "favor" policy, where we e.g. add pages linked 
>from that url, and in doing so we give them a higher score. 
>Recursively, we could apply the same policy for the next generation 
>pages, or perhaps only for pages belonging to the same domain. So, 
>in a sense the original notion of high-quality would cascade down to 
>other linked pages. The important aspect of this to note is that all 
>newly discovered pages would be subject to the same policy - unless 
>we have compelling reasons to switch the policy (from "favor" to 
>"default" or to "distrust"), which at that point would essentially 
>change the shape of the expanding frontier.
>
>If a given initial url is a known spammer, we would like to apply a 
>"distrust" policy for adding pages linked from that url (e.g. adding 
>or not adding, if adding then lowering their score, or applying 
>different score calculation). And recursively we could apply a 
>similar policy of "distrust" to any pages discovered this way. We 
>could also change the policy on the way, if there are compelling 
>reasons to do so. This means that we could follow some high-quality 
>links from low-quality pages, without drilling down the sites which 
>are known to be of low quality.
>
>Special care needs to be taken if the same page is discovered from 
>pages with different policies, I haven't thought about this aspect 
>yet... ;-)

This sounds like the TrustRank algorithm. See 
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust 
attenuation via trust dampening (reducing the trust level as you get 
further from a trusted page) and trust splitting (OPIC-like approach).

>What would be the benefits of such approach?
>
>* the initial page + policy would both control the expanding 
>crawling frontier, and it could be differently defined for different 
>starting pages. I.e. in a single web database we could keep 
>different "collections" or "areas of interest" with differently 
>specified policies. But still we could reap the benefits of a single 
>web db, namely the link information.
>
>* URLFilters could be grouped into several policies, and it would be 
>easy to switch between them, or edit them.
>
>* if the crawl process realizes it ended up on a spam page, it can 
>switch the page policy to "distrust", or the other way around, and 
>stop crawling unwanted content. From now on the pages linked from 
>that page will follow the new policy. In other words, if a crawling 
>frontier reaches pages with known quality problems, it would be easy 
>to change the policy on-the-fly to avoid them or pages linked from 
>them, without resorting to modifications of URLFilters.
>
>Some of the above you can do even now with URLFilters, but any 
>change you do now has global consequences. You may also end up with 
>awfully complicated rules if you try to cover all cases in one rule 
>set.

The approach we took (with Nutch 0.7) is to use the Page nextScore 
field as a 'crawl priority' field. We apply a scoring function to 
each page, that takes into account the contents of the page, the 
page's domain (we have a hand-selected set of known "good" domains 
and sub-domains), and the page's OPIC score. This gets divided up 
among the valid outlinks (after passing these through the URL 
filter), and summed into the appropriate Page.nextScore entries in 
the web db.

Then at crawl time we sort by nextScore, and pick a percentage of the 
total unfetched links.

This gives us a pretty good depth-first crawl, where "depth" is 
defined by the page content scoring function and the set of trusted 
domains.

The four issues we've run into with this approach are:

1. Even with a pretty broad area of interest, you wind up focusing on 
a subset of all domains. Which then means that the max threads per 
host limit (for polite crawling) starts killing your efficiency.

To work around this, we've modified Nutch to order the fetch list 
URLs by domain, and constrain the max # of URLs per domain based on 
the total number of URLs to be fetched, and the number of threads 
we're using.

The next step is to turn on HTTP keep-alive, which would be pretty 
easy other than some funky issues we've run into with the http 
connection pool, and the fact that the current protocol plug-in API 
doesn't give us a good channel to pass back the info that the fetch 
thread needs to control the keep alive process.

2. With a vertical crawl, you seem to wind up at "Big Bob's Server" 
sooner/more often than with a breadth first crawl. Going wide means 
you spend a lot more time on sites with lots of pages, and these are 
typically higher performance/better behaved. With vertical, we seem 
to hit a lot more slow/nonconforming servers, which then kill our 
fetch performance.

Typical issues are things like servers sending back an endless stream 
of HTTP response headers, or trickling data back for a big file.

To work around this, we've implemented support for monitoring thread 
performance and interrupting threads that are taking too long.

3. With a vertical crawl, you typically want to keep the number of 
fetched URLs (per loop) at a pretty low percentage relative to the 
total number of unfetched URLs in the WebDB. This helps with 
maintaining good crawl focus.

The problem we've run into is that the percentage of time spent 
updating the WebDB, once you're past about 10M pages, starts to 
dominate the total crawl time.

For example, we can fetch 200K pages in an hour on one machine, but 
then it takes us 2.5 hours to update a 15M page WebDB with the 
results. We can do a fetch in parallel with the update from the 
previous crawl, but that doesn't buy us much. The typical solution is 
to increase the number of URLs fetched, to balance out the fetch vs. 
update time, but this winds up wasting bandwidth when most of the 
extra URLs you fetch aren't all that interesting.

We're hoping that switching to Nutch 0.8 will help solve this issue. 
We're in the middle of integrating our mods from 0.7 - don't know how 
hard this will be.

4. We'd like to save multiple scores for a page, for the case where 
we're applying multiple scoring functions to a page. But modifying 
the WebDB to support a set of scores per page, while not hurting 
performance, seems tricky.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Mime
View raw message