lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Worms <>
Subject Re: Fw: LARM / Re: Avalonized WebCrawler
Date Fri, 31 Jan 2003 21:42:42 GMT

I forwarded this email to Avalon-users list in the hope they will 
correct / leverage our discussion. Many times, I am speaking about 
Merlin without deep knowledge.

> 1. I wonder how ...crawl.fetcher is working, since there seem to be 
> some
> typos:
>   - DefaultFetcherTaskFacotry.xinfo (o<->t)
>     contains a reference to
>       com.celavi.crawl.fetcher.FetcherTaskFacotry
>     which doesn't exist

OK, you are right, it took me a while to understand what the .xinfo 
does before I finally reach the conclusion, nothing. I started to learn 
fortress with the crawler (used phoenix before) and used the examples 
present in the fortress CVS, which contains the .xinfo files. But those 
file are only used by phoenix (auto generated via xdocklet) and merlin.

> 2. Why crawl.Main and crawl.CrawlMain?

Here is the idea:
- crawl.Main is the entry point and a temporary hack. The "service" 
method in which I manually initialize one component after the other 
should not be there and will be removed at one point.
- crawl.CrawlMain has the ability to become a component of its own. Or 
maybe the term "block" is more appropriate. Both Phoenix and Merlin 
have this concept. A block can export different services. So our 
CrawlMain block initialize our (inner) crawl container (Merlin or 
Fortress), and make its most relevant interface visible to a (super) 
(LARM) container (Merlin, Fortress, or Phoenix).

> 3. Do you think dynamically configuring the whole pipeline from a 
> config
> file would be possible? The contents of com.celavi.crawl.Main.service()
> should come from a config file, say pipeline_xy.xml (more than one 
> pipeline
> config should be possible, say crawler_pipeline and indexer_pipeline).
> Depending on the contents of this file, another config file should 
> contain
> the config values for each component (say crawl_full, 
> crawl_incrementally)

Yes that should be possible. The pipeline I created is a very simple 
one. It is easy to configure as long as each stage implement a same 
"MessageListener" interface (with the additional lifecycles). You 
mention the ability to configure different pipeline. Interesting, I was 
just looking at this yesterday. I spend some time trying to find out 
what the hell this "event" excalibur package could bring us. the 
promise of a SEDA architecture. but I am not sure how that all work. I 
got a sample code @

> 4. What is your rule of thumb what becomes a component and what stays a
> class?

To me, a component is an instance that should be instantiated at the 
application startup and that should be accessible to many other units 
(components). This is not how I will define a component, but it is the 
approach I took when I started to refractor a code I didn't understand 
at the time (and there are still some stuff I am not familiar with).
I did not try to look at your code, take a breath and see how I could 
decompose the system into components. Instead, I see any object that 
will be instantiated at the application startup and destroyed at the 
application shutdown as a candidate.
More or less, everything present in your "FetcherMain" object became a 

> 5. Why Fortress and not a different container (just curious, I don't 
> have
> any preference)?

I learn Avalon with Phoenix first. Great, I love it. Extremely easy to 
access Phoenix through AltRMI without a change in you code, same to 
configure your app with JMX. However, what if we want the crawler 
embedded inside another application. Phoenix can only be run in 
standalone. Here is were Merlin and Fortress can help.We can have our 
Fortress based application run from a Main method, inside a servlet, or 
even better, inside Phoenix as a block. I choose Fortress over Merlin 
because it is closer from a release.

> 6. It appears to me that Fortress is creating proxy components that 
> act as
> facades to the underlying component interfaces (am I right here?). 
> This is
> exactly what I wanted to avoid. It simply becomes too heavy weighted 
> (unless
> we use typical component patterns). Since we may well create 100,000
> URLMessages per second, it would kill us to send every call to
> urlMessageFactory.createURLMessage through a proxy. I wonder if the 
> other
> available containers work the same way? (I know Phoenix doesn't do 
> this)

I am not sure about this. Can someone help us? I think we should look 
at the component handlers (the lifestyle) in Fortress: 
org.apache.excalibur.fortress.handler package.

> 7. As far as I can see, each MessageProcessor (State/MessageListener 
> in your
> terms) adds _itself_ to a message handler that it has to know about (as
> defined in DefaultMessageListenerSelector.xinfo). Doesn't this violate 
> the
> IoC pattern? Shouldn't an external component initialize the message 
> handler
> with the listeners according to a defined order? (the order is at the 
> moment
> given only implicitly by the order the config files are processed).

You are right. It is the logical move. First, each stage was 
registering itself with the MessageHandler. Then I introduce the 
MessageListenerSelector which instantiate each stage and then register 
them. Now, MessageHandler should be registering the stages by calling 
the MessageListenerSelector.selectAll() during its own initialization.

still trying to find out a lot of stuffs... I really learn a lot from 
your code...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message