lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <spr...@gmx.eu>
Subject RE: Design questions
Date Thu, 24 Jan 2008 21:10:39 GMT
OK, I will give this a try.

Now I have the problem that I do not know how to get the offsets (or
positions? What is the difference?) back from the searched document...

There is a IndexReader#termPositions (Term t) - but this returns the
positions for the whole index, not a single document.



> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com] 
> Sent: Donnerstag, 24. Januar 2008 20:56
> To: java-user@lucene.apache.org
> Subject: Re: Design questions
> 
> I think you'll have to implement your own Analyzer and count.
> That is, every call to next() that returns a token will have to
> also increment some counter by 1.
> 
> To use this, you must have some way of knowing when a page
> ends, and at that point you call your instance of your custom
> analyzer to see what the count is. Or your analyzer maintains
> the list and you can call for it after you've added all the pages.
> 
> Analyzer.getPositionIncrementGap is called every time you
> call document.add("field".....
> 
> So, you have something like this
> while (more pages for doc) {
>    string pagedata = getPageText();
>    doc.add("text", pagedata);
> }
> 
> Under the covers, your custom analyzer adds the current offset
> (which you've kept track of) to, say, an ArrayList. And after the
> last page is added, you get this arraylist and add it to your
> document.
> 
> Or, you could just do things twice. That is, send your text through
> a TokenStream, then call next() and count. Then send it all
> through doc.add().
> 
> There are probably cleverer ways, but that should do for a start.
> 
> Best
> Erick
> 
> On Jan 24, 2008 2:33 PM, <spring@gmx.eu> wrote:
> 
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > > Sent: Freitag, 11. Januar 2008 16:16
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Design questions
> >
> > > But you could also vary this scheme by simply storing in 
> your document
> > > the offsets for the beginning of each page.
> >
> > Well, this is the best for my app I think, but...
> >
> > How do I find out these offsets?
> >
> > I'm adding the content field with:
> >
> > IndexWriter#add(new Field("content", myContentReader));
> >
> > I have no clue how find out the offsets in this reader. 
> Must be something
> > with an analyzer and a TokenStream?
> >
> > Thank you
> >
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message