nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: Ranking Algorithms
Date Mon, 18 May 2009 20:20:34 GMT
The answer is simple and not so simple at the same time.  Last year we 
put in quite a bit of work to implement a stable PageRank like algorithm 
into Nutch.  This was released as the new scoring and indexing 
frameworks.  That give a good general relevancy score, but it is really 
a starting point.

Many people look at search engines and see a single algorithms, such as 
PageRank.  In reality, a modern search engine, such as google or yahoo, 
will have hundreds of algorithms and jobs that contribute to relevancy 
of search results.  This is because of two factors:

1) After getting good general relevancy (i.e. link analysis and such), 
search relevancy is about handling specific relevancy issues.  For 
example handling reciprocal links, near duplicate detection, 
organizations that own 100k domains, template pages, blogs and echo 
chambers, hacked pages and blogs with link and keyword spam, malware, 
etc.  Each of these types of issues, and there are many more, require 
specific algorithms to handle them.

Google and Yahoo would have algorithms (and people who specialize in 
certain areas) to handle all of these types of issues usually through 
statistical analysis and machine learning jobs.  These jobs would then 
be aggregated together (think pipeline) to form final search engine 
relevancy scores.

In all fairness, this is offline relevancy.  There would also be a 
considerable amount of work done on query parsing and online relevancy.

2) Relevancy scores change over time due to people and companies 
attempting to manipulate search results through SEO (both good and bad), 
  through culture in general, and through search engines working through 
better algorithms.

So this is a long way of explaining that while Nutch has IMO a good 
general relevancy currently, taking it to the next level to where 
results are "as good as google" is going to take many different 
specialized MapReduce jobs that we currently don't have.

Dennis

atencorps wrote:
> Nutch is a great search Engine and was recently pleased when the large multi
> national I work for did some trials of Nutch Vs Google when we were
> evaluating and looking for Enterprise search, was glad to say Nutch was a
> worthy competitor thus Google Enterprise was chosen only due to office
> politics (prefering large company over smaller etc ).
> 
> In terms of Enterprise Search I think Nutch already has it covered , my
> question is towards Internet Search.
> 
> Thus Pagerank has been around for over 10 yrs and is what built Google. Are
> there any newer more capable Ranking algorithms available, and also are
> there any vision in terms of implementing a truely worthy ranking algorithm
> into Nutch that could truely deliver quality Internet Search results like
> Google ?.
> 
> 
> 
> 

Mime
View raw message