lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Query with exact number of tokens
Date Fri, 21 Sep 2018 17:04:27 GMT
I have made a FieldType specially for this
https://github.com/cominvent/exactmatch/ <https://github.com/cominvent/exactmatch/>

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 21. sep. 2018 kl. 18:14 skrev Steve Rowe <sarowe@gmail.com>:
> 
> Link correction - wrong fragment identifier in ref #5 - should be:
> 
> [5] https://lucene.apache.org/solr/guide/7_4/other-parsers.html#function-range-query-parser
> 
> --
> Steve
> www.lucidworks.com
> 
>> On Sep 21, 2018, at 12:04 PM, Steve Rowe <sarowe@gmail.com> wrote:
>> 
>> Hi Sergio,
>> 
>> Chris “Hoss” Hostetter has a solution to this kind of problem here: https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E
. See also the suggestions in comments on SOLR-12673[1], which include a version of Hoss’ss
solution.
>> 
>> Hoss’ss solution assumes a multivalued StrField with values counted using CountFieldValuesUpdateProcessorFactory,
which doesn’t apply to you.  You could instead count unique tokens in an analyzed field
using the StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik Hatcher’s
Lucene/Solr Revolution 2013 talk[4].
>> 
>> Your script could look something like this (untested; replace "<field type>”
with your field type):
>> 
>> =====
>> function getUniqueTokenCount(analyzer, fieldName, fieldValue) { 
>> var uniqueTokens = {}; 
>> var stream = analyzer.tokenStream(fieldName, fieldValue);
>> var termAttr = stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
>> stream.reset();
>> while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] = 1; } 
>> stream.end(); 
>> stream.close(); 
>> return Object.keys(uniqueTokens).length;
>> }
>> function processAdd(cmd) {
>> var analyzer = req.getCore().getLatestSchema().getFieldTypeByName("<field type>").getIndexAnalyzer();
>> doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer, null, content));
>> }
>> function processDelete(cmd) { }
>> function processMergeIndexes(cmd) { }
>> function processCommit(cmd) { }
>> function processRollback(cmd) { }
>> function finish() { }
>> =====
>> 
>> And your query could then look something like (replace "<field>” with your
field name)[5][6]:
>> 
>> =====
>> fq={!frange l=0 h=0}sub(unique_token_count_i,sum(termfreq(<field>,’CENTURY’),termfreq(<field>,’BANCORP’),termfreq(<field>,‘INC’)))
>> =====
>> 
>> Note that to construct the query ^^ you’ll need to tokenize and uniquify terms
on the client side - if tokenization is non-trivial, you could use Solr's Field Analysis API[8]
to perform tokenization for you.
>> 
>> [1] https://issues.apache.org/jira/browse/SOLR-12673 
>> [2] https://wiki.apache.org/solr/ScriptUpdateProcessor
>> [3] https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
>> [4] https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
>> [5] https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser
>> [6] https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function
>> [7] https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function
>> [8] https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Sep 21, 2018, at 10:45 AM, Erick Erickson <erickerickson@gmail.com>
wrote:
>>> 
>>> A variant on Alexandre's approach is:
>>> at index time, count the tokens that will be produced yourself (this
>>> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
>>> in your analysis for instance).
>>> Put the number of tokens in a separate field
>>> At query time, you'd search q=+company_name:(+century +bancorp +inc)
>>> +tokens_in_company_name_field:3
>>> 
>>> You don't need phrase queries with this approach, order doesn't matter.
>>> 
>>> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
>>> BANCORP, INCORPORATED." match?
>>> 
>>> Again, though, this means your indexing code has to do the same thing
>>> as your analysis chain. Which isn't very hard if the analysis chain is
>>> simple. I might use a char _filter_ factory to remove all
>>> non-alphanumeric characters, then a whitespace tokenizer and
>>> (probably) a lowercasefilter. That's pretty easy to replicate in order
>>> to count tokens.
>>> 
>>> Best,
>>> Erick
>>> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
>>> <arafalov@gmail.com> wrote:
>>>> 
>>>> I think you can match everything in the query to the field using either
>>>> 1) disMax/eDisMax with mm=100%
>>>> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
>>>> 2) Complex Phrase Query Parser with inOrder=false:
>>>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>>> 
>>>> The number of tokens though is hard. You only know what your tokens
>>>> are at the end of the indexing pipeline. And during search, the tokens
>>>> are looked up from their indexes and only then the documents are
>>>> looked up.
>>>> 
>>>> You may be able to do this with custom Postfilter that would run after
>>>> everything else to just reject records with extra tokens. That would
>>>> not be too expensive.
>>>> 
>>>> Or (possibly simpler way) you could try to precalculate things, by
>>>> writing a custom TokenFilter that takes a stream and returns token
>>>> count to be used as a copyField target. Then you send your query to
>>>> the same field with any full-query preserving syntax, either as a
>>>> phrase or as a field query parser:
>>>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>>>> 
>>>> I would love to know if any/all of this works for you.
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> On 21 September 2018 at 09:00, marotosg <marotosg@gmail.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> I have to search for company names where my first requirement is to find
>>>>> only exact matches on the company name.
>>>>> 
>>>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find
"NEW
>>>>> CENTURY BANCORP, INC."
>>>>> because the result company has the extra keyword "NEW".
>>>>> 
>>>>> I can't use exact match because the sequence of tokens may differ. Basically
>>>>> I need to find results where the  tokens are the same in any order and
the
>>>>> number of tokens match.
>>>>> 
>>>>> I have no idea if it's possible as include in the query the number of
tokens
>>>>> and solr field has that info within to match it.
>>>>> 
>>>>> Thanks for your help
>>>>> Sergio
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message