lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <>
Subject Re: Strange behavior of positionIncrementGap
Date Fri, 11 Aug 2006 19:42:34 GMT

Chris Hostetter wrote on 08/11/2006 09:08 AM:
> (using lower case
> to indicate no tokens produced and upper case to indicate tokens were
> produced) ...
> 1) a b C _gap_ D             ...results in:  C _gap_ D
> 2) a B _gap_ C _gap_ D       ...results in:  B _gap_ C _gap_ D
> 3) A _gap_ b _gap_ c _gap_ D ...results in:  A _double_gap_ D
> that the behavior you are seeing?
Almost.  The only difference is that case 3 has 3 gaps, so it's A
_triple_gap_ D.
> Only case #3 seems "wrongish" to me there. ... i started to explain why i
> thought it made sense to go ahead and "fix this", where by fix i ment only
> insert one gap in case#3 ... and then realized i was acctually arguing in
> favor of the current behavior for case#3, here is why...
>    based on the semi-frequently discussed usage of token gap sizes to
>    denote sentence/paragraph/page boundaries for the purpose of sloppy
>    phrase queries, it certianly seems worthwhile to fix to me (so that
>    queries like "find Erik within 3 pages of Otis" still work even if one
>    of those pages is blank ...
> ...that's when i realized the current behavior of case#3 is acctually
> important for accurate matching, otherwise a search for two words within a
> certain number of pages would have a false match if those pages were
> blank.  case #1 seems fine, but case #2 seems like the "wrong" case to me
> know, becuase trying to find occurances of "B" on page #1 using a
> SpanFirst query will have false positives ... it seems like the
> positionIncrimentGap should always be called/used after any field value is
> added (even if the value results in no okens) before the next value is
> added (even if that value results in no tokens)
> Does this jive with what you were expecting, and the patch you were
> considering?
Precisely.  The same concern about SpanFirstQuery also applies to case
1.  My bulk update code was always generating the positionIncrementGap
between all field values, so if there are 4 values it would always
generate 3 gaps independent of whether or not the values generate
tokens.  For your cases it generated:

1) a b C D ...results in:  _gap_ _gap_ C _gap_ D
2) a B C D ...results in:  _gap_ B _gap_ C _gap_ D
3) A b c D ...results in:  A _gap_ _gap_ _gap_ D

This seems a natural behavior and is consistent with the use cases you
describe (which are essentially the same reason I'm using gaps, and
presumably the main purpose of gaps).

Hoss, do you think it would be ok to fix given the potential upward
incompatibility for index-format-dependent implementaitons?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message