nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cook <nab...@candiru.com>
Subject Re: Missing pages & anchor text
Date Tue, 29 Aug 2006 14:17:59 GMT

Hi Stefan,

Yes, you're right. The index built without deduping does not have the first
instance of the problem (though of course, it's also filled with duplicates,
so it has other problems). It still shows the problems with missing
redirects, though this could be something else (will investigate that next). 

A little digging has turned up more information:

1) Dedup throws away content matches, and decides which one to pick based
upon score. This leads it to dump the wrong page, because:

http://www.x.com/
    score: 1.2
http://www.x.com/index.html
    score: 1.8

I see two problems.

First, there is clearly a scoring problem (possibly my fault somehow; could
this have resulted from my failing to build the index properly?). The root
page actually has 9 inlinks; the index.html page has none. I can't see
anything that would warrant the index.html getting a higher score, even were
these actually different pages. Seems like this could be related to the
problems you've already discovered. One (perhaps just short term?)
possibility would be to use the inbound linkcount for deciding which page
becomes the "canonical" version of a duplicate set, since this is probably
more stable than the scores.

Second, these are in fact the same page. Regardless of which page "wins" by
score, dedup should actually merge the two entries since this is a safe
normalization, given that we know the content fingerprints are the same. The
anchor texts and the scores should be combined. We can't necessarily do this
for the general dedup case -- a page shouldn't necessarily benefit just
because there are multiple copies of it -- though even there we may be able
to combine some anchor text. But in this case these are not multiple copies;
they are the same page.

In any case, we should work hard not to lose anchor text unless it is
completely justified (e.g. for spam). For relevance purposes, anchor text is
more important than any other page feature, score included. And especially
in our world of small, focused crawls, it is a precious, scarce resource.

Thoughts? Comments?

-Doug


Stefan Groschupf-2 wrote:
> 
> Hi Doug,
> I'm pretty sure that your problem is related to the deduping of your  
> index.
> In general the hash of the content of a page is used as key for the  
> dedub tool.
> We ran into the the forwarding problem also in a other case.
> https://issues.apache.org/jira/browse/NUTCH-353
> So may be we should think about a general solution of the forwarding  
> problem.
> 
> Greetings,
> Stefan
> 
> 
> Am 28.08.2006 um 11:33 schrieb Doug Cook:
> 
>>
>> Hi, folks,
>>
>> I have just started digging into relevance issues with Nutch, and I'm
>> running into some mysteries. Before I dig too deep, I wanted to  
>> check to see
>> if these were known issues (a quick search of the email archives  
>> and of JIRA
>> didn't turn up anything). I'm running 0.8 with a handful of patches.
>>
>> I'm frequently finding root pages of sites missing from my index,  
>> despite
>> the fact that they have been fetched. In my admittedly short  
>> investigation I
>> have found two classes of cases:
>>
>> 1. Root URL is not a redirect, but there is a root-level index.html  
>> page.
>> The index.html page is in the index, but the root page is not.
>> Unfortunately, most of the anchor text points to the root page, not  
>> the
>> /index.html page, and the anchor text has gone "missing" along with  
>> its
>> associated page, so relevance is poor.
>>
>> 2. Root URL is a redirect to another page. Again, this other page  
>> is in the
>> index, the but the root page, along with its anchor text, has gone
>> "missing."
>>
>> I have a deduped index. Both of these cases could result from dedup  
>> throwing
>> out the wrong URL, i.e. the one with more anchor text, although one  
>> might
>> expect dedup to merge the two anchor texts (at least in the case of  
>> pages
>> which commonly normalize to the same URL, e.g. / and /index.html).
>>
>> The second case might result from the root URL somehow being  
>> normalized to
>> its redirect target, but in that case (incorrect, in any case) I would
>> expect the anchor text to also be attached to the redirect target,  
>> and it is
>> not.
>>
>> I'm about to rebuild with no deduping and see what I find.
>>
>> Thanks for your help & comments-
>>
>> Doug
>> -- 
>> View this message in context: http://www.nabble.com/Missing-pages--- 
>> anchor-text-tf2179049.html#a6025652
>> Sent from the Nutch - Dev forum at Nabble.com.
>>
>>
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> Menlo Park, California
> http://www.101tec.com
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6039836
Sent from the Nutch - Dev forum at Nabble.com.


Mime
View raw message