nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Per-page crawling policy
Date Thu, 05 Jan 2006 17:41:57 GMT
Doug Cutting wrote:

> Stefan Groschupf wrote:
>
>> Before we start adding meta data and more meta data, why not once in  
>> general adding meta data to the crawlDatum, than we can have any  
>> kinds of plugins that add and process metadata that belongs to a url.
>
>
> +1
>
> This feature strikes me as something that might prove very useful, but 
> might also prove unworkable, or at least not useful to everyone.  Thus 
> it would be best if it doesn't require changes to a core class like 
> CrawlDatum.  If it does eventually prove generally useful, as 
> something that everyone will use and that should be enabled by 
> default, then we could promote its data from metadata to a field for 
> efficiency.
>
> In this vein, should modifiedTime be moved to metadata, once metadata 
> is added?


I'm of a split mind on this, because I hope that the detection of 
unmodified content will be the default mode of operation... OTOH, 
perhaps it's a premature micro-optimization. We can move it to metadata 
for now, but I see it as a strong candidate to be moved back...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message