lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <>
Subject Re: Query with exact number of tokens
Date Fri, 21 Sep 2018 15:20:40 GMT
Hmm, I was suggesting to put TokenCountingFilter at the end of both
indexing and query chains for the same (e.g. name_count) field. Then,
the search would be something like (warning, major syntax errors):
fq={!eDisMax v=queryname mm=100%}name&
fq={!complexphrase inOrder=true df=name_count v=queryname}

So, the name_count would do the token match and it would allow for
synonyms of "INC" vs "INCORPORATED" as usual, if needed.


On 21 September 2018 at 10:45, Erick Erickson <> wrote:
> A variant on Alexandre's approach is:
> at index time, count the tokens that will be produced yourself (this
> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
> in your analysis for instance).
> Put the number of tokens in a separate field
> At query time, you'd search q=+company_name:(+century +bancorp +inc)
> +tokens_in_company_name_field:3
> You don't need phrase queries with this approach, order doesn't matter.
> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
> Again, though, this means your indexing code has to do the same thing
> as your analysis chain. Which isn't very hard if the analysis chain is
> simple. I might use a char _filter_ factory to remove all
> non-alphanumeric characters, then a whitespace tokenizer and
> (probably) a lowercasefilter. That's pretty easy to replicate in order
> to count tokens.
> Best,
> Erick
> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
> <> wrote:
>> I think you can match everything in the query to the field using either
>> 1) disMax/eDisMax with mm=100%
>> 2) Complex Phrase Query Parser with inOrder=false:
>> The number of tokens though is hard. You only know what your tokens
>> are at the end of the indexing pipeline. And during search, the tokens
>> are looked up from their indexes and only then the documents are
>> looked up.
>> You may be able to do this with custom Postfilter that would run after
>> everything else to just reject records with extra tokens. That would
>> not be too expensive.
>> Or (possibly simpler way) you could try to precalculate things, by
>> writing a custom TokenFilter that takes a stream and returns token
>> count to be used as a copyField target. Then you send your query to
>> the same field with any full-query preserving syntax, either as a
>> phrase or as a field query parser:
>> I would love to know if any/all of this works for you.
>> Regards,
>>    Alex.
>> On 21 September 2018 at 09:00, marotosg <> wrote:
>> > Hi,
>> >
>> > I have to search for company names where my first requirement is to find
>> > only exact matches on the company name.
>> >
>> > For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
>> > because the result company has the extra keyword "NEW".
>> >
>> > I can't use exact match because the sequence of tokens may differ. Basically
>> > I need to find results where the  tokens are the same in any order and the
>> > number of tokens match.
>> >
>> > I have no idea if it's possible as include in the query the number of tokens
>> > and solr field has that info within to match it.
>> >
>> > Thanks for your help
>> > Sergio
>> >
>> >
>> >
>> > --
>> > Sent from:

View raw message