nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@101tec.com>
Subject Re: Missing pages & anchor text
Date Tue, 29 Aug 2006 06:36:29 GMT
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your  
index.
In general the hash of the content of a page is used as key for the  
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding  
problem.

Greetings,
Stefan


Am 28.08.2006 um 11:33 schrieb Doug Cook:

>
> Hi, folks,
>
> I have just started digging into relevance issues with Nutch, and I'm
> running into some mysteries. Before I dig too deep, I wanted to  
> check to see
> if these were known issues (a quick search of the email archives  
> and of JIRA
> didn't turn up anything). I'm running 0.8 with a handful of patches.
>
> I'm frequently finding root pages of sites missing from my index,  
> despite
> the fact that they have been fetched. In my admittedly short  
> investigation I
> have found two classes of cases:
>
> 1. Root URL is not a redirect, but there is a root-level index.html  
> page.
> The index.html page is in the index, but the root page is not.
> Unfortunately, most of the anchor text points to the root page, not  
> the
> /index.html page, and the anchor text has gone "missing" along with  
> its
> associated page, so relevance is poor.
>
> 2. Root URL is a redirect to another page. Again, this other page  
> is in the
> index, the but the root page, along with its anchor text, has gone
> "missing."
>
> I have a deduped index. Both of these cases could result from dedup  
> throwing
> out the wrong URL, i.e. the one with more anchor text, although one  
> might
> expect dedup to merge the two anchor texts (at least in the case of  
> pages
> which commonly normalize to the same URL, e.g. / and /index.html).
>
> The second case might result from the root URL somehow being  
> normalized to
> its redirect target, but in that case (incorrect, in any case) I would
> expect the anchor text to also be attached to the redirect target,  
> and it is
> not.
>
> I'm about to rebuild with no deduping and see what I find.
>
> Thanks for your help & comments-
>
> Doug
> -- 
> View this message in context: http://www.nabble.com/Missing-pages--- 
> anchor-text-tf2179049.html#a6025652
> Sent from the Nutch - Dev forum at Nabble.com.
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message