lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <>
Subject RE: Proposal for Lucene / new component
Date Sun, 03 Mar 2002 20:02:52 GMT
On Sat, 2002-03-02 at 19:10, Halácsy Péter wrote:
> > -----Original Message-----
> > From: Andrew C. Oliver []
> > Sent: Tuesday, February 26, 2002 2:13 PM
> > To: Lucene Developers List
> > Subject: Re: Proposal for Lucene / new component
> > 
> > 
> > Humm.  Well said.  I'm not against using Avalon.  My approach to
> > software is this though:  Get a working draft.  Refactor it into that
> > *stand the test of time* for your second or third release.  Things
> > change...iterate.  Not against a super configurable masterpiece...but
> > first I want to crawl and index web pages over httpd in various
> > pluggable mime formats.. Once we get there...
> > 
> Hello,
> I had been abroad last week and it took at least 30 min to read the discussion about
avalon. It's great!
> Someone mentioned that Avalon is only used by Cocoon. Well, we are using cocoon and I'm
very happy that it is Avalon based. I think that is the main reason of flexibility. BTW Cocoon
uses Lucene, pls refer to
> I think if you need logging, configuring, threading, pooling (for the crawler) and want
to be component based you need a framework some thing like avalon. It took one day to understand
Avalon and write the first Hello world application but you can save a lot of time while coding.

Great!  Can you post your work to get the Hello Avalon App somewhere? 
If you could document along those lines as well then I'll be happy to go
and write a "getting started" guide for Avalon.  

I'm not objecting to using Avalon provided I can actually understand
it.  I'm really close thanks to the fine work of Ken Barrozzi 
(, but
I'm one step away from actually being about to start using Avalon.  Its
not a "I won't" its an "I can't" issue.  

> Iteration is very good practica in software development and can be applied to avalon
based application as well. First you should only write interfaces. First time you can implement
fake component that works like the a real one. After a while you can change the working component
by rewriting the config file.

I kinda believe in  writing components that work or do something useful
early on.  

> For example I think the http crawler is built from more than one component:
> 1. the fetcher that connects to the webserver, gets the page from the url
> responsible for: downloading the page as is (handling network errors), handling HTTP
status codes (for example redirects)
> configurable by: proxy server, max open sockets
> 2. component that parses the fetched page and extract relevant metadata
> 3. a component that is an interface to the loader; it gets the fetched and parsed pages
from the parser (or gets command from the fetcher to delete pages from the search database)
> this interface can be implemented in several components:
> one that puts the data in files (if the loader and the search db is on other box)
> one that gives the data to the loader component (that is in the same JVM)
> and so on
> 4. one that feeds urls to the crawler's database 
> responsible for: 
> extracting links from the dowloaded pages
> handling manually submitted urls (submitted by users or sysadmins)
> filtering out the exluded urls
> configurable by: excluding rules

awesome, can you patch the proposal with how you propose to do that?

> 5. one that reads urls from the database and feed them to the fetcher
> the most sophisticated component that responsible for: 
> choosing the right url to crawl:
>  -  it can use a priority list based on url patterns
>  - do not fetch a lot of pages from the same server (max 1 request/min)
>  - robots.txt file
> configurable by: priority lists, max urls from a host
> 6. and the last component is the database itself; it can be a JDBC compliant database
or something file system based
> responsible for: adding/deleting url to/from the database (url: last fetched date, last
HTTP status code, last action [add or delete])
> aswering host related questions: how many urls were fetched from the host, what time
was the last url fetched,  robots.txt of the host
> I know it's not a modell of a working http crawler but please notice:
> 1. using avalon you can change the implementation of a component in 30 seconds (if someone
implemented it ;)
> 2. you don't have to work on implementing logging, configuration system, database pooling
for JDBC 
> 3. the crawler is a component that needs no information about the search database (and
the loader/indexer dosn't know the crawler)
> 4. the parser and loader interface component can be used in file based HTML crawler (that
reads static HTML pages from the directory of the webserver in [if the engine is used in intranet])
> 5. having different loader components you can built a search engine for simple JVM or
for distributed system (and you do not need to implement in the first iteration cycle)
> OK, this mail is already too long and I'm tired.
> peter

Cool, my only problems are, if I'm to participate in development
involving using Avalon I must understand Avalon; some folks have already
written/donated some tremendous code that does some of these things. 
I'd like to reuse this code -- I'm happy to help refactor it to
Avalon...but it goes back to #1.

Anyhow, maybe I'm just not skilled enough to grasp Avalon (I've thought
it was just a poor-documentation issue).  If that prevents me from
contributing to this effort in a meaningful way then no big deal.  My
goal is to help facilitate the work in any way I can.  If that means
Avalon, fine, but up until now I've mostly failed to get it.  If you're
able then how about getting us started with some Avalon-esque



> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>
-- - port of Excel/Word/OLE 2 Compound Document 
                            format to java 
			- fix java generics!
The avalanche has already started. It is too late for the pebbles to
-Ambassador Kosh

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message