lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Re: Define end-of-paragraph
Date Tue, 03 Oct 2006 21:39:18 GMT
Hi Reuven,

In my haste last night, I pointed you at the wrong fields on Token. You
need to set the position to create inter-paragraph gaps, not the
offsets, so you want Token.setPositionIncrement() for that approach, or
Analyzer.getPositionIncrementGap() if you use the multi-field approach.

You will likely have performance problems with Documents that have
thousands of fields, so I would not recommend that approach. Are you
only matching paragraphs rather than whole documents? If so, another
approach would be to make each paragraph a separate document. Then you
could store document and paragraph id's in separate fields and have all
the information you want.

If you need whole document matching, but want the paragraph number of
matches, one approach might be to use SpanQuery's together with a
position-encoding of paragraph numbers. E.g., place you paragraphs
starting at positions 0, 10000, 20000, 30000, ... Then from the
positions on the spans you find, you can identify what paragraph you are in.

I'm sure you can come up with many other ways to represent this
information as well.

Hope this helps,


Reuven Ivgi wrote on 10/02/2006 11:27 PM:
> Hello,
> To be more precise, the basic entity I am using is a document, each with
> paragraphs which may be up to few thousands. I need the proximity search
> within a paragraph, yet, I want to get as a search result the paragraph
> number also. Maybe, defining each paragraph as separate field it the
> best way
> What do you think?
> Thanks in advance 
> Reuven Ivgi
> -----Original Message-----
> From: Chuck Williams [] 
> Sent: Tuesday, October 03, 2006 10:58 AM
> To:
> Subject: Re: Define end-of-paragraph
> Reuven Ivgi wrote on 10/02/2006 09:32 PM:
>> I want to divide a document to paragraphs, still having proximity
> search
>> within each paragraph
>> How can I do that?
> Is your issue that you want the paragraphs to be in a single document,
> but you want to limit proximity search to find matches only within a
> single paragraph?  If so, you could parse your document into paragraphs
> and when generating tokens for it place large gaps at the paragraph
> boundaries.  Each Token in lucene has a startOffset and endOffset that
> you can set as you generate Tokens inside for the
> TokenStream returned by your Analyzer.  Those classes and methods are
> all in org.apache.lucene.analysis.  Or alternatively, you could make
> each paragraph a separate field value and use
> Analyzer.getPositionIncrementGap() to achieve essentially the same thing
> (except that your Documents could get unwieldy if you that have many
> paragraphs).
> If this is not what you are trying to do, then please explain your
> objectives precisely.
> Good luck,
> Chuck
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit 
> ______________________________________________________________________
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message