lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <>
Subject Re: LARM Crawler: Status // Avalon?
Date Wed, 19 Jun 2002 20:36:28 GMT
> If you are interested, I can send you a class that is written as a
> NekoHTML Filter, which I use for extracting title, body, meta keywords
> and description.

Sure, send it over. But isn't the example packaged with Lucene doing the

> Have I mentioned framework here before?
> I read about it in JavaPro a few months ago and chose it for an
> application that I was/am writing.  It allows for a very elegant and
> simple (in terms of use) producer/consumer pipeline.
> I've actually added a bit of functionality to the version that's at
> and sent it to the author who will, I believe, include it in
> the new version.
> Also, the framework allows for distributed consumer pipeline with
> different communication protocols (JMS, RMI, BEEP...).  That is
> something that is not available yet, but the author told me about it
> over a month ago.

Hmm.. I'll have a look at it. But keep in mind that the current solution is
working already, and we probably only need one very simple way to transfer
the data.

> We want this whole pipeline to be
> > configurable
> > (remember, most of it is still done from within the source code).
> stuff doesn't have anything that allows for dynamic
> configurations, but it may be good to use because then you don't have
> to worry about developing, maintaining, fixing yet another component,
> which should really be just another piece of your infrastructure on top
> of which you can construct your specific application logic.

yep, right. that's what i hate about c++ programs (also called
'yet-another-linked-list-implementation's :-)) i'll have a look at it; I
just think the patterns used in LARM are probably too simple to be worth the
exchange. But I'll see.

By the way, I thought about the "putting all together in config files"
thing: It's probably sufficient to have a couple of applications (main
classes) that put the basic stuff together, and whose parts are then
configurable through property files. At least now.
I just have this feeling, but I fear some things could become very nasty if
we have to invent a declarative configuration language that describes the
configuration of the pipelines, or at least whose components tell the
configuring class which other components they need to know of... (oh, that
looks like we need component based development...)...

>> Lots of open questions:
>> - LARM doesn't have the notion of closing everything down. What
>> happens if IndexWriter is interrupted?

I must add that in general I don't have experience with using Lucene
incrementally, that is, updating the index while others are using it. Is
that working smoothly?

>As in what if it encounters an exception (e.g. somebody removes the
>index directory)?  I guess one of the items that should them maybe get
>added to the to-do list is checkpointing for starters.

Hm... what do you mean...?
>From what I understand you mean that then the doc is stored in a repository
until the index is available again...? [confused]

One last thought:
- the crawler should be be started as a daemon process (at least optionally)
- it should wake up from time to time to crawl changed pages
- it should provide a management and status interface to the outside.
- it internally needs the ability to run service jobs while crawling
(keeping memory tidy, collecting stats, etc.)

from what I know, these matters could be addressed by the Apache
project. Does anyone know anything about it?


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message