lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-6624) provide a BookendFilter to make the "exact match against an entire (tokenized) field value" usecase easy
Date Sat, 27 Jun 2015 01:03:04 GMT
Hoss Man created LUCENE-6624:
--------------------------------

             Summary: provide a BookendFilter to make the "exact match against an entire (tokenized)
field value" usecase easy
                 Key: LUCENE-6624
                 URL: https://issues.apache.org/jira/browse/LUCENE-6624
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Hoss Man


A question that seems to pop up every now and then is how to require an "exact match" against
"an entire field value" even while using some sort of analysis feature (ie: stopwords, or
lowercasing, or whitespace normalization).

In other words: instead of a literal, byte for byte, "exact match" (eg: {{new StringField(f,
val, Store.NO)}} at index time; {{new TermQuery(new Term(f, val))}} at query time) some folks
want to use some Tokenizer and TokenFilter but then require that a "PhraseQuery" (or SpanNearQuery)
on the input matches the entire field value, w/o any terms left over.

Example: they want a (phrase) queries like {{"The Quick Brown Dog"}} and {{"quick BROWN dog"}}
to both match a document indexed with a field value "{{The Quick Brown Dog.}}" because their
analyzer tokenizes both the query & the field value into {{quick | brown | dog}} (standard
tokenizer + stopword & lowercase filters) -- BUT -- on the other hand they don't want
either of those phrase queries to match a document with a field value of "{{I Love the Quick
Brown Dog}}" because that field value includes additional terms not covered by the query.


A suggestion i've seen for years in response to this type of question is that folks can "inject
marker tokens" at the begining and end of both the field values & query, and then (as
long as there is no "slop" on the phrase queries) they should get the matches they expect.
 The hackish way to do this being to just prepend and append some strings that won' be found
in their data and won't be striped out by their tokenizer or any token filters (eg: {{new
TextField(f, "VAL_START_XYZABC " + val + " VAL_END_XYZABC", Store.NO)}} at index time; {{queryBuilder.createPhraseQuery(f,
"VAL_START_XYZABC " + val + " VAL_END_XYZABC")}} at query time).


Unless i'm missing something, it should be fairly trivial to write a "BookendFilter" that
that does this automatically for users:

* the first time {{incrementToken()}} is called, produce a synthetic "start"  token with some
CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
* after that, all calls to {{incrementToken()}} proxy to the wrapped stream until it's exhausted
* after that, when {{incrementToken()}} is called, produce a synthetic "end" token with some
CharTermAttribute that is uses a non-printing unicode sequence (overridable by user config)
* both synthetic tokens should have KeywordAttribute == true

...At index time the sythetic tokens will be indexed as terms, and if the same analyzer is
used at query time to build a PhraseQuery those terms will be the first and last terms in
the PhraseQuery.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message