lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Allan Hill <>
Subject RE: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words
Date Wed, 01 Feb 2012 19:04:31 GMT
Thanks for the discussion,  I really appreciate you pointing out that the

> Code here ignores  PhraseQuery (PQ) 's positions:

And by "here" you mean my original code not your suggestion.

> To accommodate for this, the overall extra gap can be added to the slope:
>     int gap = (pp[pp.length] - pp[0]) - (pp.length - 1);  // (+/- boundary
> cases)
>     slope += gap;

At 1st I was thinking my refinement of this would be to consider the original slop provided
by the user and only extend it when necessary.
For example:
"The Importance of Being Earnest"~2
Already has enough slop to take into consideration the stop words 'the' and 'of', so no need
to just add more to the slop. 
But a slop of 2 really means the user would accept.
[The Importance of Really Truly Being Earnest]  but I see that requires a slop of 3 to skip
[of] [Really] [Truly]

But I'm not sure if I understand the 'edit distance' for a phrase with more than 2 words.
 Does it apply to _all_the_edits_combined to bring the quoted phrase to match the index phrase
as suggested by your calculation?

Also, do any "boundary cases"  (as mentioned in your comment) come to mind?

> Also, this code suggestion simplifies in the case that the analyzer in effect may emit
more than one
> term at the same position - for example when expanding the query with synonyms, or when
> originals and stemmed forms - in that case just comparing pp[0] and pp[pp.length-1] is
> and the positions should be examined while looping the phrase terms, something like this:

I don't understand what you mean that it simplifies, since you already listed the simplification
in your first example which I think would work in cases with or without synonyms, so no need
to walk through each distance as shown in your later code.

>    int dpos = pp[i+1] - p[i]; // (i>0)
>    if (dpos > 1)
>        slope += (dpos -1);
> Haven't tested this - just to give you an idea what to try next.

Thanks for your input, I will experiment with some code that considers the original PQ positions
when considering the slop value of any generated SpanNearQuery.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message