lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Merging Lucene documents
Date Sun, 06 Jan 2008 22:43:56 GMT
Well, I wonder if offsets mean anything in this context. I suppose another
way of asking this is "why do you care?". What is it that you're trying
to make happen? Or prevent from happening?

The offsets are used for a variety of purposes, but what does it mean
for two offsets to be "close" in this context? Imagine that you have
the following pages:
p1
  p1.1
  p1.2
    p 1.2.1
    p 1.2.2
  p1.3
  p1.4
p2
  p2.1
  p2.2
     p2.2.1
     p2.2.2


Now, is a token on p1.2.1 near or far from a token on p2.2.1? Does
it depend upon how you happened to crawl the site? If so, could you
crawl it one time and get pages p1 followed by p2  the first time, and
the second time get p1, p1.1, p1.2..... p2 the next time? What does
adjacency mean then? and since it can change anyway, does
it really provide you with anything useful?

What all this is leading to is that so far, this seems like an XY problem.
That is, you're trying to solve some (as yet unstated) problem and
assuming that tokenizing is the answer. What is the higher-level
problem you're struggling with? Why couldn't you go ahead and
index each page as a single document, *Also* index the base URL,
then just search with an "AND URL:blah.blort.blivet" clause?

I think my first approach would be really stupid. Just crawl the site, and
write the resulting pages out to one file per site on disk, then index those
files. If for no other reason than re-crawling the website every time will
be an utter pain in the neck because you'll be waiting forever for each
iteration of your index since you have to go back to the site each time.
*Then* worry about elegance <G>...

But I will say, about offsets: If you're overriding next(), you can make
them anything you want.

Best
Erick

On Jan 6, 2008 1:54 PM, Developer Developer <devquestions@gmail.com> wrote:

> Hi Eric,
>
> No, you are not off base. You are on track, but here is my problem.
>
> I have a requirement to create one lucene document per site i.e suppose I
> crawl www.xxx.com which has 1000 pages in it. If I use nutch then it will
> create 1000 lucene documents i.e 1 document per page. My requirement is to
> combine all 1000 pages in to just one lucene document.
>
> One approach is to construct an in memory String by combining content from
> all the pages and then index it in lucene as one document, but this is not
> an elegant approach because the in memory String would be a memory hog.
> Therefore I am trying to construct tokenStream for each document as
> follows
>
>                  StandardAnalyzer st = new StandardAnalyzer();
>                TokenStream stream = st.tokenStream("content", new
> StringReader(documentText);
>
> and then  construct a LuceneDocument by using a Field based on tokenStream
>
> *Field<
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20org.apache.lucene.analysis.TokenStream%29
> >
> *(String<
> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html?is-external=true
> >
> name,
> TokenStream<
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenStream.html
> >
>  tokenStream)
>
> Then TokenStream would be my own implementation which will overrise next()
> method and return tokens one by one . With this approach I can avoid
> creating a huge in memory string.
>
> So, I am wondering will the tokens have correct offset values with this
> approach.
>
> Thanks !
>
>
>
>
> On Jan 6, 2008 1:13 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>
> > I don't get what you mean about extracting tokenstreams. Tokenstreams
> > are, as far as I understand, an analysis-time class. That is, either
> when
> > originally indexing the document or when analyzing a query.
> >
> > If you do not have the entire document stored in the index, you have to
> > do something like reconstruct the document from the indexed data, which
> > is time-consuming. But see the Luke code for a way to do this.
> >
> > If you *do* have stored fields, then you have the raw text available.
> >
> > In either case, you eventually get a string representation of the
> various
> > fields in the documents you want to combine. Why not just index that?
> > Since this is an index process and (presumably) can take some time,
> > you could either concatenate the strings together in memory and index
> > the string or write it to a file on disk and then index *that*.
> >
> > If this is way off base, perhaps a bit more explanation of the problem
> > you're trying to solve would be in order.
> >
> > Best
> > Erick
> >
> > On Jan 6, 2008 12:45 PM, Developer Developer <devquestions@gmail.com>
> > wrote:
> >
> > > Hello Friends,
> > >
> > > I have a unique requirement of merging two or more lucene indexed
> > > documents
> > > into just one indexed document . For example
> > >
> > > Document newDocutmet = doc1+doc2+doc3
> > >
> > > In order to do this I am planning to extract tokenstreams form each
> > > document
> > > ( i.e doc1, doc2 and doc3) , and use them to construct newDocument .
> The
> > > reason is , I do not have access to the content of the original
> > documents
> > > (doc1,doc2,doc3)
> > >
> > >
> > > My questions are
> > >
> > > 1. Is this the correct approach
> > > 2. Do I have to update the start and end offsets of the tokens since
> the
> > > tokens from original documents (doc1, 2,3) were relative to the
> original
> > > documents, and in the newDocument these offsets may be wrong.
> > > 3. If Yes, then how do I make sure that the mergeded tokens have
> correct
> > > start and end offset.
> > >
> > > Thanks !
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message