lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <>
Subject Re: Considering lucene
Date Fri, 30 Sep 2005 05:26:34 GMT
This requirement is almost exactly the same as my requirement for the  
log4j project I work on where I wanted to be able to index every row  
in a text log file to be it's own Document.

It works fine, but treating each line as a Document turns out to take  
a while to index (searching is fantastic though I have to say) due to  
the cost of adding a Document to an index.  I don't think Lucene is  
currently tuned (or tunable) to that level of Document granularity,  
so it'll depend on your requirement of timeliness of the indexing.

I was hoping (of course it's a big ask) to be able to index a million  
rows of relatively short lines of text (as log files tend to be) in a  
'few moments", no more than 1 minute, but even with pretty grunty  
hardware you run up against the bottleneck of the tokenization  
process (the StandardAnalyzer is not optimal at all in this case  
because of the way it 'signals' EOF with an exception).

There was someone (apoligise, I've forgotten his name, I blame the  
holiday I just came back from) that could treat a relatively small  
file, such as an XML file, and very quickly index that for on the fly  
XPath like queries using Lucene which apparently works very well, but  
I'm not sure it scales to massive documents such as log files (and  
your requirements).


Paul Smith

On 30/09/2005, at 3:17 PM, <> <> wrote:

> Hi,
> My name is Palmik Bijani and I have recently started a new software  
> company
> in India. After initial research, Lucene has surfaced as a leading  
> contender
> for our needs. We have also purchased the Lucene book which we are  
> expecting
> in a couple weeks. However, I was hoping to get an answer to the  
> following
> as we are unable to find this information from everything we have  
> read so
> far on Lucene. We don’t know if the book covers this requirement of  
> ours.
> Our requirement is for row based keyword search in a single very  
> large text
> file which can potentially hold millions of rows (with delimited  
> fields per
> row). In other words, we would like Lucene to filter and return  
> only the row
> numbers within a file for the respective row that hold the keywords  
> we query
> for a particular field in each row.
> From everything we have seen so far, Lucene can handle a large set  
> of files
> and tokenizes the keywords within each file and returns the  
> matching file
> name per keyword – but I have not seen anything about segmenting and
> searching by rows.
> From Lucene’s context, one can think of each row as a separate  
> file, field
> data within each row as document content, and each row number as  
> the unique
> file name.
> From what I have read about Lookoutsoft had used Lucene for Outlook  
> email
> searches, it seems to me that it should be possible as  
> fundamentally even
> email searching is row based.
> Is our requirement something that Lucene can inherently handle  
> well, or
> would it require extensive tweaking and code changes on our end?
> Your response is greatly appreciated.
> Thank you,
> Palmik
> -- 
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.344 / Virus Database: 267.11.9/115 - Release Date:  
> 9/29/2005

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message