lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <>
Subject Re: Creating indexes
Date Wed, 12 Jun 2002 15:56:52 GMT
 Lucene doesn't know where a file start or ends, actually it knows, but in your case 1 Docuemtn
contains more small documents.If you want to split your big file in small files you must to
that by yourself, Take a look at the Document class and you will see that Lucene use a Reader
to index the body of a file, so may be you should build a class that return a Reader for each
sub-document you want.
But i think is easier split your main document in small document, index this small documents
with a common "keyword" that is the actual Big file name, so when you'll search you can understand
where this "sub" document is allocated. After you index those files you can delete them. What
you need is a BigDocumentManager that:

1.split your big file/s
2.index them. (don't forget the keyword => big doc name)
3.delete those "sub" documents (are like temp docs).

Hope this helps.


On Wed, 12 Jun 2002 02:26:58  
 Chris Sibert wrote:
>I have a big ( 40 MB or so) file to index. The file contains a whole bunch
>of documents, which are each pretty small, about a few typewritten pages
>long. There's a title, date, and author for each document, in addition to
>the documents' actual text.
>I'm not quite sure how you index this in Lucene. For each document in the
>original file, I assume that I create a separate Lucene Document object in
>the index with author, date, title, and text fields. If so, my question is
>that when I'm reading in the original file for indexing, does Lucene know
>where each document begins and ends in the original file ? Or do I have to
>write a parser or filter or something for the InputStream that's reading the
>file ?
>Chris Sibert
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

WIN a first class trip to Hawaii.  Live like the King of Rock and Roll
on the big Island. Enter Now!

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message