nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <>
Subject [jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.
Date Fri, 09 Nov 2007 15:26:51 GMT


Dennis Kubes commented on NUTCH-574:

I agree, refactoring the code to a plugin is a better solution.  Have started down that path.
 One issue is how we are doing the matching.

The initial problem is that we are indexing words for pages that don't contain those words,
but for words that are contained in the page we want the boost factor.  So as I see it there
are two options.  

1) We can be strict and say if an inbound link contains *any* anchor text that is not currently
in the page then we don't index the entire link.  
2) We can manipulate the text of the anchor remvoing any words in the anchor text that do
no appear in the page and in effect not indexing those words.  

I am leaning toward the second option of indexing all links but removing words.  I think it
is likely that a some of the words in a link will be on the page and some will not and we
want to include those that are and exclude those that are not.  Would like opinions on this.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>                 Key: NUTCH-574
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>         Attachments: NUTCH-574-1.patch
> Currently the basic indexing filter includes inbound anchor text for a given URL in the
index.  This sometimes allows pages to show up in search results where they may not be relevant.
 An example of this is a search for "dallas hotels" in our production index (
 Google would show up first in this example although there is no text matching either dallas
or hotels on the google home page.  What is happening here is there are inlinks into google
with the words dallas and hotels which get included in the index for and because
google would have a very high boost due to inlinks, google shows up first for these search
terms.  I propose we add an option to allow/prevent inlink anchor text from being included
in the index and set the default for this option to NOT include inbound link anchor text.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message