lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <>
Subject RE: Creating indexes
Date Wed, 19 Jun 2002 13:49:14 GMT
 ok, let's reorganize:

1.You know how to split the file.
2.If your files is like 40 mb and you do not have to index a big number you can "store" your
"sub_document" in the index, take a look at the class Document -> the body field should
be: Indexed, Tokenized, Stored.
3.A stored field can be loaded just retriving the Document from the Hits and getting the field
4.If you don't like the point 3, you can also store in you file system your "sub_documents",
i suggest this solution: when you parse the big file , split it into little files and save
them as [keyword].txt, into your a common folder with name [big file name], e.g:
/[big file name]/__10001__txt.
5.Run the index on that files.
6.add a keyword to reconize the "big file name", so add the folder name as a keyword: indexed,stored,not
tokenized. your query as you want and if you need to search just on a particular big file, just
run a query using the and set the key folder to your preferred
8.If you want highlight your document do a search on that mailing list "highlight" and you'll
find something.

Is that ok?


On Wed, 19 Jun 2002 10:54:43  
 Nader S. Henein wrote:
>just store the whole thing into the indexc .. it'll make the index bigger
>but then it'll allow you to find method in madness, manually parsing a forty
>meg file everytime you need to display search results is too intensive.
>Nader Henein
>-----Original Message-----
>From: Chris Sibert []
>Sent: Wednesday, June 19, 2002 10:47 AM
>To: Lucene Users List
>Subject: Re: Creating indexes
>The file that I have is big, about 40 MB. And it's got a whole lot of
>smaller documents in it - about 15 thousand - too many to separate into
>individual files. These individual documents are actually similar to emails
>stored in a large text file. The file is structured to an extent, with a
>number before each document - (ex: __10001__, __10002__, etc.), with the
>date, etc. Kind of like email headers.
>In the Lucene index, it seems like I'll have to:  1) use a DocumentNumbers
>field to index all of the document numbers, 2) a Dates field to index the
>document dates, 3) and a TextBody field to index all of the document text
>together. I'll have to write an InputStreamFilter or something to parse the
>data as it's coming in to the lucene IndexWriter, create a new document
>every time I hit a new number, and parse out the numbers - like __10001__ -
>so I can separate them out in the DocumentNumbers field, the dates into a
>Dates field, and the text in a TextBody field. It won't be pleasant writing
>that parser, but...
>My other issue at this point is how to then display the documents that
>relate to the search hits. I have to be able to open that 40 MB file and go
>to the document(s) that correspond to the hits in the index, for display to
>the user. Does Lucene keep a location stored in the index of where each word
>is found in the original file ? How do I know at what point in the original
>data file to find the offset to display the original document ? Is this
>something that I have to store myself in each document object in the index ?
>Is this why you create separate document objects in the Lucene index ? -
>Each new document object in the index will contain the file offset to the
>original data file ? And if Lucene doesn't put that file offset in there
>automagically, I would have to store that myself as I create the index, in
>someting like a FileOffsetLocation field, for each document. Am I on the
>right track here ?
>----- Original Message -----
>From: "none none" <>
>To: "Lucene Users List" <>
>Sent: Wednesday, June 12, 2002 11:56 AM
>Subject: Re: Creating indexes
>> Lucene doesn't know where a file start or ends, actually it knows, but in
>your case 1 Docuemtn contains more small documents.If you want to split your
>big file in small files you must to that by yourself, Take a look at the
>Document class and you will see that Lucene use a Reader to index the body
>of a file, so may be you should build a class that return a Reader for each
>sub-document you want.
>> But i think is easier split your main document in small document, index
>this small documents with a common "keyword" that is the actual Big file
>name, so when you'll search you can understand where this "sub" document is
>allocated. After you index those files you can delete them. What you need is
>a BigDocumentManager that:
>> 1.split your big file/s
>> 2.index them. (don't forget the keyword => big doc name)
>> 3.delete those "sub" documents (are like temp docs).
>> Hope this helps.
>> --
>> On Wed, 12 Jun 2002 02:26:58
>>  Chris Sibert wrote:
>> >I have a big ( 40 MB or so) file to index. The file contains a whole
>> >of documents, which are each pretty small, about a few typewritten pages
>> >long. There's a title, date, and author for each document, in addition to
>> >the documents' actual text.
>> >
>> >I'm not quite sure how you index this in Lucene. For each document in the
>> >original file, I assume that I create a separate Lucene Document object
>> >the index with author, date, title, and text fields. If so, my question
>> >that when I'm reading in the original file for indexing, does Lucene know
>> >where each document begins and ends in the original file ? Or do I have
>> >write a parser or filter or something for the InputStream that's reading
>> >file ?
>> >
>> >Chris Sibert
>> >
>> >
>> >
>> >--
>> >To unsubscribe, e-mail:
>> >For additional commands, e-mail:
>> >
>> >
>> _______________________________________________________
>> WIN a first class trip to Hawaii.  Live like the King of Rock and Roll
>> on the big Island. Enter Now!
>> --
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

Communicate with others using Lycos Mail for FREE!

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message