nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.
Date Sun, 11 Nov 2007 19:33:50 GMT


Andrzej Bialecki  commented on NUTCH-574:

I don't rule it out - I support the patch as is, i.e. separating the anchor indexing from
index-basic. My point was that anchor text is a complicated issue, and how you use anchor
depends on your requirements - in other words, I think it may be difficult to find a more
advanced solution that would satisfy most users.

Some comments to the latest patch:

* I think it would be good to put a NOTE: in CHANGES.txt that reminds users who wish to keep
the curent behavior that they should make sure that their nutch-default / nutch-site.xml contain
this plugin in plugin.includes.

* there are literal Tab characters in plugin/build.xml - they should be converted to spaces.

Other than that I think the patch can be applied as is, and we should continue the discussion

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>                 Key: NUTCH-574
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
> Currently the basic indexing filter includes inbound anchor text for a given URL in the
index.  This sometimes allows pages to show up in search results where they may not be relevant.
 An example of this is a search for "dallas hotels" in our production index (
 Google would show up first in this example although there is no text matching either dallas
or hotels on the google home page.  What is happening here is there are inlinks into google
with the words dallas and hotels which get included in the index for and because
google would have a very high boost due to inlinks, google shows up first for these search
terms.  I propose we add an option to allow/prevent inlink anchor text from being included
in the index and set the default for this option to NOT include inbound link anchor text.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message