lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Developer Developer" <devquesti...@gmail.com>
Subject Re: Merging Lucene documents
Date Sun, 06 Jan 2008 18:54:14 GMT
Hi Eric,

No, you are not off base. You are on track, but here is my problem.

I have a requirement to create one lucene document per site i.e suppose I
crawl www.xxx.com which has 1000 pages in it. If I use nutch then it will
create 1000 lucene documents i.e 1 document per page. My requirement is to
combine all 1000 pages in to just one lucene document.

One approach is to construct an in memory String by combining content from
all the pages and then index it in lucene as one document, but this is not
an elegant approach because the in memory String would be a memory hog.
Therefore I am trying to construct tokenStream for each document as follows

                  StandardAnalyzer st = new StandardAnalyzer();
                TokenStream stream = st.tokenStream("content", new
StringReader(documentText);

and then  construct a LuceneDocument by using a Field based on tokenStream

*Field<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20org.apache.lucene.analysis.TokenStream%29>
*(String<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html?is-external=true>
name,
TokenStream<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenStream.html>
 tokenStream)

Then TokenStream would be my own implementation which will overrise next()
method and return tokens one by one . With this approach I can avoid
creating a huge in memory string.

So, I am wondering will the tokens have correct offset values with this
approach.

Thanks !




On Jan 6, 2008 1:13 PM, Erick Erickson <erickerickson@gmail.com> wrote:

> I don't get what you mean about extracting tokenstreams. Tokenstreams
> are, as far as I understand, an analysis-time class. That is, either when
> originally indexing the document or when analyzing a query.
>
> If you do not have the entire document stored in the index, you have to
> do something like reconstruct the document from the indexed data, which
> is time-consuming. But see the Luke code for a way to do this.
>
> If you *do* have stored fields, then you have the raw text available.
>
> In either case, you eventually get a string representation of the various
> fields in the documents you want to combine. Why not just index that?
> Since this is an index process and (presumably) can take some time,
> you could either concatenate the strings together in memory and index
> the string or write it to a file on disk and then index *that*.
>
> If this is way off base, perhaps a bit more explanation of the problem
> you're trying to solve would be in order.
>
> Best
> Erick
>
> On Jan 6, 2008 12:45 PM, Developer Developer <devquestions@gmail.com>
> wrote:
>
> > Hello Friends,
> >
> > I have a unique requirement of merging two or more lucene indexed
> > documents
> > into just one indexed document . For example
> >
> > Document newDocutmet = doc1+doc2+doc3
> >
> > In order to do this I am planning to extract tokenstreams form each
> > document
> > ( i.e doc1, doc2 and doc3) , and use them to construct newDocument . The
> > reason is , I do not have access to the content of the original
> documents
> > (doc1,doc2,doc3)
> >
> >
> > My questions are
> >
> > 1. Is this the correct approach
> > 2. Do I have to update the start and end offsets of the tokens since the
> > tokens from original documents (doc1, 2,3) were relative to the original
> > documents, and in the newDocument these offsets may be wrong.
> > 3. If Yes, then how do I make sure that the mergeded tokens have correct
> > start and end offset.
> >
> > Thanks !
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message