nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cook <nab...@candiru.com>
Subject Missing pages & anchor text
Date Mon, 28 Aug 2006 18:33:15 GMT

Hi, folks,

I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted to check to see
if these were known issues (a quick search of the email archives and of JIRA
didn't turn up anything). I'm running 0.8 with a handful of patches.

I'm frequently finding root pages of sites missing from my index, despite
the fact that they have been fetched. In my admittedly short investigation I
have found two classes of cases:

1. Root URL is not a redirect, but there is a root-level index.html page.
The index.html page is in the index, but the root page is not.
Unfortunately, most of the anchor text points to the root page, not the
/index.html page, and the anchor text has gone "missing" along with its
associated page, so relevance is poor.

2. Root URL is a redirect to another page. Again, this other page is in the
index, the but the root page, along with its anchor text, has gone
"missing."

I have a deduped index. Both of these cases could result from dedup throwing
out the wrong URL, i.e. the one with more anchor text, although one might
expect dedup to merge the two anchor texts (at least in the case of pages
which commonly normalize to the same URL, e.g. / and /index.html).

The second case might result from the root URL somehow being normalized to
its redirect target, but in that case (incorrect, in any case) I would
expect the anchor text to also be attached to the redirect target, and it is
not.

I'm about to rebuild with no deduping and see what I find.

Thanks for your help & comments-

Doug
-- 
View this message in context: http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6025652
Sent from the Nutch - Dev forum at Nabble.com.


Mime
View raw message