lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Query with exact number of tokens
Date Fri, 21 Sep 2018 15:20:40 GMT
Hmm, I was suggesting to put TokenCountingFilter at the end of both
indexing and query chains for the same (e.g. name_count) field. Then,
the search would be something like (warning, major syntax errors):
.../select?
queryname=CENTURY BANCORP, INC&
q=*:*
fq={!eDisMax v=queryname mm=100%}name&
fq={!complexphrase inOrder=true df=name_count v=queryname}

So, the name_count would do the token match and it would allow for
synonyms of "INC" vs "INCORPORATED" as usual, if needed.

Regards,
   Alex.

On 21 September 2018 at 10:45, Erick Erickson <erickerickson@gmail.com> wrote:
> A variant on Alexandre's approach is:
> at index time, count the tokens that will be produced yourself (this
> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
> in your analysis for instance).
> Put the number of tokens in a separate field
> At query time, you'd search q=+company_name:(+century +bancorp +inc)
> +tokens_in_company_name_field:3
>
> You don't need phrase queries with this approach, order doesn't matter.
>
> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
> BANCORP, INCORPORATED." match?
>
> Again, though, this means your indexing code has to do the same thing
> as your analysis chain. Which isn't very hard if the analysis chain is
> simple. I might use a char _filter_ factory to remove all
> non-alphanumeric characters, then a whitespace tokenizer and
> (probably) a lowercasefilter. That's pretty easy to replicate in order
> to count tokens.
>
> Best,
> Erick
> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
> <arafalov@gmail.com> wrote:
>>
>> I think you can match everything in the query to the field using either
>> 1) disMax/eDisMax with mm=100%
>> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
>> 2) Complex Phrase Query Parser with inOrder=false:
>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>
>> The number of tokens though is hard. You only know what your tokens
>> are at the end of the indexing pipeline. And during search, the tokens
>> are looked up from their indexes and only then the documents are
>> looked up.
>>
>> You may be able to do this with custom Postfilter that would run after
>> everything else to just reject records with extra tokens. That would
>> not be too expensive.
>>
>> Or (possibly simpler way) you could try to precalculate things, by
>> writing a custom TokenFilter that takes a stream and returns token
>> count to be used as a copyField target. Then you send your query to
>> the same field with any full-query preserving syntax, either as a
>> phrase or as a field query parser:
>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>
>> I would love to know if any/all of this works for you.
>>
>> Regards,
>>    Alex.
>>
>> On 21 September 2018 at 09:00, marotosg <marotosg@gmail.com> wrote:
>> > Hi,
>> >
>> > I have to search for company names where my first requirement is to find
>> > only exact matches on the company name.
>> >
>> > For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
>> > CENTURY BANCORP, INC."
>> > because the result company has the extra keyword "NEW".
>> >
>> > I can't use exact match because the sequence of tokens may differ. Basically
>> > I need to find results where the  tokens are the same in any order and the
>> > number of tokens match.
>> >
>> > I have no idea if it's possible as include in the query the number of tokens
>> > and solr field has that info within to match it.
>> >
>> > Thanks for your help
>> > Sergio
>> >
>> >
>> >
>> > --
>> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Mime
View raw message