nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.
Date Fri, 09 Nov 2007 15:54:50 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541359
] 

Enis Soztutar commented on NUTCH-574:
-------------------------------------

Honestly, i don't think not indexing anchor words that do not appear in the web site text
is not a wise solution. What made google so successful is indexing anchor text + PR, the classic
example being that, the page http://www.honda.com/ never mentions that Honda is a car manufacturer,
but the anchor text does.   

That said, I think we should focus on finding a way to eliminate the noise on anchor text.
At this point we take the first 10K links and discard the others, due to size constraints.
But a better way would be to select the best ones, or select the most frequent words, etc.





> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given URL in the
index.  This sometimes allows pages to show up in search results where they may not be relevant.
 An example of this is a search for "dallas hotels" in our production index (www.visvo.com).
 Google would show up first in this example although there is no text matching either dallas
or hotels on the google home page.  What is happening here is there are inlinks into google
with the words dallas and hotels which get included in the index for google.com and because
google would have a very high boost due to inlinks, google shows up first for these search
terms.  I propose we add an option to allow/prevent inlink anchor text from being included
in the index and set the default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message