lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Considering lucene
Date Fri, 30 Sep 2005 20:30:33 GMT

On Sep 30, 2005, at 1:26 AM, Paul Smith wrote:

> This requirement is almost exactly the same as my requirement for  
> the log4j project I work on where I wanted to be able to index  
> every row in a text log file to be it's own Document.
> It works fine, but treating each line as a Document turns out to  
> take a while to index (searching is fantastic though I have to say)  
> due to the cost of adding a Document to an index.  I don't think  
> Lucene is currently tuned (or tunable) to that level of Document  
> granularity, so it'll depend on your requirement of timeliness of  
> the indexing.

There are several tunable indexing parameters that can help with  
batch indexing.  By default it is mostly tuned for incremental  
indexing, but for rapid batch indexing you may need to tune it to  
merge less often.

> I was hoping (of course it's a big ask) to be able to index a  
> million rows of relatively short lines of text (as log files tend  
> to be) in a 'few moments", no more than 1 minute, but even with  
> pretty grunty hardware you run up against the bottleneck of the  
> tokenization process (the StandardAnalyzer is not optimal at all in  
> this case because of the way it 'signals' EOF with an exception).

Signals EOF with an exception?  I'm not following that.  Where does  
that occur?

> There was someone (apoligise, I've forgotten his name, I blame the  
> holiday I just came back from) that could treat a relatively small  
> file, such as an XML file, and very quickly index that for on the  
> fly XPath like queries using Lucene which apparently works very  
> well, but I'm not sure it scales to massive documents such as log  
> files (and your requirements).

Wolfgang Hoschek and the NUX project may be what you're referring  
to.  He contributed the MemoryIndex feature found under contrib/ 
memory.  I'm not sure that feature is a good fit for the log file or  
indexing files line-by-line though.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message