nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: New Extension Points?
Date Wed, 29 Jul 2009 09:51:34 GMT
Marko Bauhardt wrote:
> Hi,
> i know you are working in the new "plugin system", osgi etc. but i want 
> to talk about  new extension points.
> I think it would be helpfully if we have for example an extension point 
> IPreCrawl and IPostCrawl. This extension points can be use to implement 
> some helpfully jobs.
> For example before starting a new crawl one implementation of IPreCrawl 
> could be
> + export urls from a "database" in a url file for inject this file into 
> the crawldb
> + or create statistics.
> If a crawl is finished one implementation of IPostCrawl could be
> + restart search servers
> + switch index
> + create statistics from this complete crawl
> + or sending email or whatever to an administrator...

This looks to me less like an extension point and more like a 
notification system, e.g. JMS-based. Currently the execution of plugins 
in extension points is synchronous, i.e. the calling application will be 
blocked until the plugin completes its execution. Most likely you want 
an asynchronous execution here?

> Also i think statistics of a segment or the crawldb are very important 
> to get an overview about the url room. So maybe an other extensionPoint 
> (e.g. ISegmentStatistic) can be used to create statistics for every 
> segment after this segment is fetched.

I agree - segment parts are immutable, so once they are created their 
statistics are also immutable. It would make even more sense to collect 
such stats on-the-fly as each part is being created, and then write them 
out to a per-segment metadata file.

BTW: we really need to move away from using the name "segment", which is 
for many reasons confusing, and towards using the name "shard" which 
seems to be the commonly used name for this kind of data.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message