nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: New nutch plugin
Date Wed, 30 Mar 2005 14:24:27 GMT
Matthias,

> maybe you want to build a advanced local search plugin:
> http://cis.poly.edu/tr/tr-cis-2005-03.pdf could be rich of information 
> how to start.
>
Very interesting article!
 From my point of view the 'geo coding' via gazetteer isn't that 
difficult, just an named entity extraction, (our ie-lib provide quite 
well :-] ).
Results need to be lookup to transform it to geo coordinates. All this 
can be done in a index filter plugin.
However the  geo coding based on incoming links is the most interesting 
but most difficult job.
The problem with nutch is that we haven't the chance to add meta data 
to the web db. This is one of the feature I really would love to see.
Cache such meta data e.g. in a database does not scale and slow down 
things very much.
I was discussing this with Doug (OS Wizard 2004 conference) and I 
clearly understand that this feature is very difficult and will 
dramatically slow down webdb.
However I strongly believe that the possibility to add meta data to web 
db is one major step.
Beside the geo coding based on geo position of incoming links we can 
use meta data for tracking update intervals of web-pages for better 
fetch lists and a set of other great functionalities.

May with the map reduce port we can introduce flexible meta data to web 
db as well as it is flexible in the index  today.

Stefan 


Mime
View raw message