lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: positional token info
Date Tue, 21 Oct 2003 16:53:55 GMT
Erik Hatcher wrote:
> Just for fun, I've written a simple stop filter that bumps the position 
> increments to account for the stop words removed:
> But its practically impossible to formulate a Query that can take 
> advantage of this.  A PhraseQuery, because Terms don't have positional 
> info (only the transient tokens), only works using a slop factor which 
> doesn't guarantee an exact match like I'm after.  A PhrasePrefixQuery 
> won't work any better as there is no way to add in a "blank" term to 
> indicate a missing position.

The PhraseQuery code predates the setPositionIncrement feature.

You can use your filter to find phrases that don't contain stop words, 
e.g., when your filter is used, a query for the phrase "phone boy" won't 
match "phone the boy", as it would with the normal stop filter, but a 
query for "phone the boy" would also only match "phone boy".

One workaround is to simply not use a stop list.  Then "phone boy" will 
only match "phone boy", and "phone the boy" will only match "phone the 
boy", and not "phone a boy" too.  One can write a query parser which 
removes stop words unless they're in phrases.  This is what Nutch and 
Google do.

If however you want "phone the boy" to match "phone X boy" where X is 
any word, then PhraseQuery would have to be extended.  It's actually a 
pretty simple extension.  Each term in a PhraseQuery corresponds to a 
PhrasePositions object.  The 'offset' field within this is the position 
of the term in the phrase.  If you construct the phrase positions for a 
two-term phrase so that the first has offset=0 and the second offset=2, 
then you'll get this sort of matching.  So all that's needed is a new 
method PhraseQuery.add(Term term, int offset), and for these offsets to 
be stored so that they can be used when building PhrasePositions.  Would 
this be a useful feature?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message