lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Indexing clarification , please advice
Date Wed, 13 Dec 2006 13:41:58 GMT
Let me take a crack at it. See below...

On 12/13/06, abdul aleem <> wrote:
> Hello All,
> Apolgies if it is a naive question
> a) Indexing large file ( more than 4MB )
>    Do i need to read the entire file as string using
> and create a Document object ?

Essentially yes. IF you must index the whole document as a single Lucene
document. my reply later for why you may not want to do this.

The following are equivalent.
Document doc = new Document();
doc.add("textfield", "data1 data2 data3");
doc.add("textfield", "data4 data5 data6");

Document doc = new Document();
doc.add("textfield", "data1 data2 data3 data4 data5 data6");

So you could read chunks of your file and index them into the *same* field
before writing the document to the index, or you could read the file as a
single chunk and index it all at once.

HOWEVER: note that the default in Lucene is to index only the first 10,000
(?) tokens for a field in a single document. See

   The file contains timestamp, if i need to index on
>    timestamp is parsing the entire file manually
> (tokenizing) and store the timestamp as document
> object is the only way ? or is there any alternatives
> ?

Perhaps I don't understand the problem. You can store the timestamp as a
*field* in a document. Is that what you mean by storing it as a "document
object"? But there's no way I know of to have Lucene automatically do
something with arbitrary text in the input stream. You could write a custom
analyzer (see Lucene In Action for the Synonym Analyzer for a model). That
Analyzer would be responsible for recognizing timestamps in the input stream
and doing something special with them.

b) I need to search the contents of a log file which
> is changing rapidly, from initial testing i see if any
> changes in the file is not reflected unless it is
> *Indexed* again
>    Do we need to index the files always before search
> if the content of the file is dynamically changed

Yes. That is, you can't search data you haven't indexed.

  ( log file has a pattern and always logs in a
> similar fashion, each time i need to index takes lot
> of time as the file is large (approach a )  is there
> any work arounds for this ? )

I think you need to re-think your approach. A document in Lucene is whatever
you want to think of it as. For instance, you could index your log file such
that each Lucene "document" was all the data added to the log file over some
specified time interval. So, say you have a log that starts at midnight.
Each Lucene "document" could be all the data added to the index for each
minute. So you'd have a document for the data added to the log between 12:00
and 12:00:59. Another "document" for all data between 12:01:00 and 12:01:59
etc. That way, you don't have to re-index the entire log, just everything
since the last interval you already indexed.

I'm not necessarily recommending this approach, but using it to illustrate
that you don't need to think of a Lucene Document as your entire log file.
You may be much better off slicing the data up somehow and having a
one-to-many relationship between your log and the Lucene documents......

Perhaps you could index every message as an individual document (which would
deal with your timestamp issue), or.....

Hope this helps

  I would greatly appreciate if any inputs on the
> above,
> Many thanks,
> Abdul
> ____________________________________________________________________________________
> Need a quick answer? Get one in minutes from people who know.
> Ask your question on
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message