lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Design questions
Date Thu, 24 Jan 2008 19:55:33 GMT
I think you'll have to implement your own Analyzer and count.
That is, every call to next() that returns a token will have to
also increment some counter by 1.

To use this, you must have some way of knowing when a page
ends, and at that point you call your instance of your custom
analyzer to see what the count is. Or your analyzer maintains
the list and you can call for it after you've added all the pages.

Analyzer.getPositionIncrementGap is called every time you
call document.add("field".....

So, you have something like this
while (more pages for doc) {
   string pagedata = getPageText();
   doc.add("text", pagedata);
}

Under the covers, your custom analyzer adds the current offset
(which you've kept track of) to, say, an ArrayList. And after the
last page is added, you get this arraylist and add it to your
document.

Or, you could just do things twice. That is, send your text through
a TokenStream, then call next() and count. Then send it all
through doc.add().

There are probably cleverer ways, but that should do for a start.

Best
Erick

On Jan 24, 2008 2:33 PM, <spring@gmx.eu> wrote:

> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Freitag, 11. Januar 2008 16:16
> > To: java-user@lucene.apache.org
> > Subject: Re: Design questions
>
> > But you could also vary this scheme by simply storing in your document
> > the offsets for the beginning of each page.
>
> Well, this is the best for my app I think, but...
>
> How do I find out these offsets?
>
> I'm adding the content field with:
>
> IndexWriter#add(new Field("content", myContentReader));
>
> I have no clue how find out the offsets in this reader. Must be something
> with an analyzer and a TokenStream?
>
> Thank you
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message