lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Merging Lucene documents
Date Sun, 06 Jan 2008 18:13:58 GMT
I don't get what you mean about extracting tokenstreams. Tokenstreams
are, as far as I understand, an analysis-time class. That is, either when
originally indexing the document or when analyzing a query.

If you do not have the entire document stored in the index, you have to
do something like reconstruct the document from the indexed data, which
is time-consuming. But see the Luke code for a way to do this.

If you *do* have stored fields, then you have the raw text available.

In either case, you eventually get a string representation of the various
fields in the documents you want to combine. Why not just index that?
Since this is an index process and (presumably) can take some time,
you could either concatenate the strings together in memory and index
the string or write it to a file on disk and then index *that*.

If this is way off base, perhaps a bit more explanation of the problem
you're trying to solve would be in order.

Best
Erick

On Jan 6, 2008 12:45 PM, Developer Developer <devquestions@gmail.com> wrote:

> Hello Friends,
>
> I have a unique requirement of merging two or more lucene indexed
> documents
> into just one indexed document . For example
>
> Document newDocutmet = doc1+doc2+doc3
>
> In order to do this I am planning to extract tokenstreams form each
> document
> ( i.e doc1, doc2 and doc3) , and use them to construct newDocument . The
> reason is , I do not have access to the content of the original documents
> (doc1,doc2,doc3)
>
>
> My questions are
>
> 1. Is this the correct approach
> 2. Do I have to update the start and end offsets of the tokens since the
> tokens from original documents (doc1, 2,3) were relative to the original
> documents, and in the newDocument these offsets may be wrong.
> 3. If Yes, then how do I make sure that the mergeded tokens have correct
> start and end offset.
>
> Thanks !
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message