lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Lucene crawler plan
Date Tue, 01 Jul 2003 12:54:02 GMT
On Monday, June 30, 2003, at 10:21  PM, Peter Becker wrote:
> this is far closer to what we are looking for. Using Ant is an 
> interesting idea, although it probably won't help us for the UI tool. 
> But we could try to layer things so we could use them for both

Yes, I'm sure a more generalized method could be developed that 
accomodates both.  Its pretty decoupled even within the Ant project 
with a DocumentHandler interface and all.

> Two differences between the Ant project and what we do right now:
> - the Ant project doesn't have a notion of an explicit file filter. I 
> think this is important if you want to extend the filter options to 
> more than just extensions and if you want some UI to manage the filter 
> mappings. BTW: does anyone know of a Java implementation for file(1) 
> magic?

Ah, but Ant *does* have more sophisticated filtering mechanisms!  :)  
The <fileset>'s that the <index> task can take can leverage any of 
Ant's built-in capabilities, such as (new in Ant 1.5) Selector 
capability.  So you could easily filter on file size, file date, etc, 
and custom Selectors can be written and plugged in.

> - the code creates Documents as return values. The reason we went away 
> from this is that we want to use the same document handler with 
> different index options. One of the core issues here is storing the 
> body or not. I don't think there is any true answer for this one, so 
> it should be configurable somehow.

Agreed.  It was a toss-up when I went to implement as who is actually 
in control of the Document instantiation and population.

>  The two options I see are either returning a data object and then 
> turning that into a Document somewhere else or passing some 
> configuration object around. Both are not really nice, the first one 
> needs to create an additional object all the time, while the second 
> one puts quite some burder on the implementer of the document handler. 
> Ideas on that one would be extremely welcome.

If you invert what I have done then the "controller" needs to know more 
information about the fields, more than you could convey in a 
String/String Map - is a field indexed or not?  Is a field tokenized or 
not?  Is it stored or not?  Who decides on the field names?  Who 
decides all of these are the questions we have to answer to do this 
type of stuff.

> Two ideas we will probably pick up from this are:
> - use Ant for creating indexes if we go larger than personal document 
> retrieval

Keep in mind you could also launch Ant via the API from a GUI as well, 
or just leverage the IndexTask itself and call it via the API and its 
execute() method.

> - use JTidy for HTML parsing (we missed that one and used Swing 
> instead, which is no good)

I think there are probably some better options out there than using 
JTidy these days, but I have not had time to investigate them.  JTidy 
does the job reasonably well though.

> So thanks again, that was quite helpful.

My pleasure!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message