nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kosiorowski <pkosiorow...@gmail.com>
Subject NUTCH-7 bug
Date Fri, 05 Aug 2005 18:40:30 GMT
Hello all,
Some time ago I sent a patch for:
http://issues.apache.org/jira/browse/NUTCH-7 (analyze runs for an
excessive amount of time and creates huge temp files until it runs out
of disk space (if you let the db grow)).
I know PageRank computation is not activly maintained and we will
probably have to revisit it after moving to mapreduce branch but I
suspect nutch-0.7 to be used by some people for some time, so maybe it
is worth fixing (especially as it is already done:)).

So if noone object I will commit it on Monday. Below you can find
motivation for a change.

The problem:

When you have a set of pages with the same MD5(I assume here such pages
are identical) they may contain two kinds of links ->
	a) relative ones
	b) absolute ones

Lets say we have two pages A and B with the same MD5.
One we have an absolute link to page C on them we end up with just one
link entry in our WebDB (MD5(A)=MD5(A) -> C).

Than we have two pages D and E also with the same MD5 but with relative
link.
When the link is relative we have two link entries:
	MD5(D)=MD5(E) -> D/relative
	MD5(D)=MD5(E) -> E/relative
During PageRank calculation we have:

for (Enumeration e = reader.pagesByMD5(); e.hasMoreElements(); curIndex++) {
    ...
    Link outLinks[] = reader.getLinks(curPage.getMD5());
    ...
}

So for each page we retrieve and process all link from pages with the
same MD5.

So back to our example:
During processing of page A we will get one link MD5(A)=MD5(A) -> C for
processing. The same for page B. This is correct behaviour.
But for page D ( and E) we will get two links:
MD5(D)=MD5(E) -> D/relative
MD5(D)=MD5(E) -> E/relative
If we have 3 pages with the same Md5 and relative link we will get 9
links for processing - so the number of links to process is growing
quite quickly (and I had more than 1mln of pages with the same MD5 in my
WebDB).
This behaviour is not correct.

As there is no way to distinguish if link was relative or not during
PageRank computation I think we can handle relative links better but it
would lead to not so perfect treatment of absolute links. My suggestion
is to process only one page from a set of pages having the same MD5 -
in this case for all relative links PageRank would be propagated from 
one page (currently just first one but we can change it to eg. with 
highest PageRank in future). For absolute links it will treat the target 
page as if only it will have one incoming link.

It is not perfect but current process is also not perfect, but it will 
do not have this quadratic increase in number of links to be processed 
for big sets of "same MD5" pages.

In fact I think it will not change much in final score as such 
problematic pages are only a fraction of all pages in WebDB. But it will 
give lower disk usage and better performance.

Regards,
Piotr







Mime
View raw message