nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Missing pages & anchor text
Date Thu, 31 Aug 2006 15:19:44 GMT
Doug Cook wrote:
> I'm thinking I should file issues on the following-
>
> 1. The scoring bug. Not sure what to file here, since such things are hard
> to pin down. But defining an "inversion" as
>         score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
> score(hostname)
> on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
> inversions and only 1585 were "okay." Is this likely to a correct behavior
> for OPIC scores? Is this a likely manifestation of a known bug? It doesn't
> seem correct, but then, it's early and I still need more coffee ;-) In any
> case, this causes the "wrong" versions of the pages to be selected most of
> the time during dedup, and I've lost >6500 of the most important, most
> anchor-text-rich pages, in my index -- a significant relevance issue.
>   

The default scoring-opic is admittedly buggy (even if the original 
algorithm is suitable for page scoring, which is not obvious at all). 
However, the inversion problem that you see may stem from the way these 
sites are interlinked - perhaps there really is a lot of inlinks 
pointing to sub-pages instead of roots of the sites?

Anyway, if you feel that shorter urls should get a higher score, then 
you can add a scoring filter to the chain, and in it boost the score 
based on the url length.

> 2. When "duplicates" really refer to the same page (e.g. X/ vs.
> X/index.html) , entries should be merged. Really, these are just
> after-the-fact normalizations, but they are a class of normalizations which
> can't be done without comparing page fingerprints, since they are not true
> for all web servers.
>   

This should already happen when you run DeleteDuplicates (dedup). Dedup 
selects pages with the same fingerprint, and then retains only newest 
version if urls are the same, OR a version with shorter url if urls are 
different.


> 3. Redirects. The index keeps the redirect target, but marks the source as
> unfetched. This is unfortunate behavior, at least for the class of redirects
> where www.x.com redirects to www.x.com/y, which, like the above combination
> of issues, causes the root pages, and thus much of the important anchor
> text, to be dropped from the index. This seems related to, if not the same
> as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
> simply planning to add these comments to that issue, unless someone hollers.
>   

Yes, as I indicated in that issue, pages we are redirected from should 
be marked as GONE, and definitely should be marked as fetched. Please 
add you comments if any aspect of what you just said is still missing 
from that issue.

> For all of the cases where we ignore/drop pages, we should think about what
> happens to the inbound anchor text. We should work very very hard to keep
> all the anchor text we have, it's by far the most important page feature for
> relevance.
>   

Agreed. This may not be so easy in some cases, due to the way Nutch 
works at the moment, but we should then discuss how to refactor it to 
support this.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message