nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Per-page crawling policy
Date Fri, 06 Jan 2006 20:41:37 GMT
Jack Tang wrote:

>Hi Andrzej
>
>The idea brings vertical search into nutch and definitely it is great:)
>I think nutch should add information retrieving layer into the who
>architecture, and export some abstract interface, say
>UrlBasedInformationRetrieve(you can implement your url grouping idea
>here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
>user can implement these in their vertical search by their own.
>  
>

We sort of reached an agreement to add Properties to CrawlDatum. Users 
will be able to put arbitrary metadata in there, so that each page 
record could be processed differently if needs be.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message