lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader S. Henein" <>
Subject RE: Creating indexes
Date Wed, 19 Jun 2002 06:54:43 GMT
just store the whole thing into the indexc .. it'll make the index bigger
but then it'll allow you to find method in madness, manually parsing a forty
meg file everytime you need to display search results is too intensive.

Nader Henein

-----Original Message-----
From: Chris Sibert []
Sent: Wednesday, June 19, 2002 10:47 AM
To: Lucene Users List
Subject: Re: Creating indexes

The file that I have is big, about 40 MB. And it's got a whole lot of
smaller documents in it - about 15 thousand - too many to separate into
individual files. These individual documents are actually similar to emails
stored in a large text file. The file is structured to an extent, with a
number before each document - (ex: __10001__, __10002__, etc.), with the
date, etc. Kind of like email headers.

In the Lucene index, it seems like I'll have to:  1) use a DocumentNumbers
field to index all of the document numbers, 2) a Dates field to index the
document dates, 3) and a TextBody field to index all of the document text
together. I'll have to write an InputStreamFilter or something to parse the
data as it's coming in to the lucene IndexWriter, create a new document
every time I hit a new number, and parse out the numbers - like __10001__ -
so I can separate them out in the DocumentNumbers field, the dates into a
Dates field, and the text in a TextBody field. It won't be pleasant writing
that parser, but...

My other issue at this point is how to then display the documents that
relate to the search hits. I have to be able to open that 40 MB file and go
to the document(s) that correspond to the hits in the index, for display to
the user. Does Lucene keep a location stored in the index of where each word
is found in the original file ? How do I know at what point in the original
data file to find the offset to display the original document ? Is this
something that I have to store myself in each document object in the index ?
Is this why you create separate document objects in the Lucene index ? -
Each new document object in the index will contain the file offset to the
original data file ? And if Lucene doesn't put that file offset in there
automagically, I would have to store that myself as I create the index, in
someting like a FileOffsetLocation field, for each document. Am I on the
right track here ?


----- Original Message -----
From: "none none" <>
To: "Lucene Users List" <>
Sent: Wednesday, June 12, 2002 11:56 AM
Subject: Re: Creating indexes

> Lucene doesn't know where a file start or ends, actually it knows, but in
your case 1 Docuemtn contains more small documents.If you want to split your
big file in small files you must to that by yourself, Take a look at the
Document class and you will see that Lucene use a Reader to index the body
of a file, so may be you should build a class that return a Reader for each
sub-document you want.
> But i think is easier split your main document in small document, index
this small documents with a common "keyword" that is the actual Big file
name, so when you'll search you can understand where this "sub" document is
allocated. After you index those files you can delete them. What you need is
a BigDocumentManager that:
> 1.split your big file/s
> 2.index them. (don't forget the keyword => big doc name)
> 3.delete those "sub" documents (are like temp docs).
> Hope this helps.
> --
> On Wed, 12 Jun 2002 02:26:58
>  Chris Sibert wrote:
> >I have a big ( 40 MB or so) file to index. The file contains a whole
> >of documents, which are each pretty small, about a few typewritten pages
> >long. There's a title, date, and author for each document, in addition to
> >the documents' actual text.
> >
> >I'm not quite sure how you index this in Lucene. For each document in the
> >original file, I assume that I create a separate Lucene Document object
> >the index with author, date, title, and text fields. If so, my question
> >that when I'm reading in the original file for indexing, does Lucene know
> >where each document begins and ends in the original file ? Or do I have
> >write a parser or filter or something for the InputStream that's reading
> >file ?
> >
> >Chris Sibert
> >
> >
> >
> >--
> >To unsubscribe, e-mail:
> >For additional commands, e-mail:
> >
> >
> _______________________________________________________
> WIN a first class trip to Hawaii.  Live like the King of Rock and Roll
> on the big Island. Enter Now!
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message