lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergio García Maroto <marot...@gmail.com>
Subject Re: Query with exact number of tokens
Date Mon, 24 Sep 2018 13:46:08 GMT
Thanks all for your ideas. It was very useful information.

On Fri, 21 Sep 2018 at 19:04, Jan Høydahl <jan.asf@cominvent.com> wrote:

> I have made a FieldType specially for this
> https://github.com/cominvent/exactmatch/ <
> https://github.com/cominvent/exactmatch/>
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 21. sep. 2018 kl. 18:14 skrev Steve Rowe <sarowe@gmail.com>:
> >
> > Link correction - wrong fragment identifier in ref #5 - should be:
> >
> > [5]
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#function-range-query-parser
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >> On Sep 21, 2018, at 12:04 PM, Steve Rowe <sarowe@gmail.com> wrote:
> >>
> >> Hi Sergio,
> >>
> >> Chris “Hoss” Hostetter has a solution to this kind of problem here:
> https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E
> . See also the suggestions in comments on SOLR-12673[1], which include a
> version of Hoss’ss solution.
> >>
> >> Hoss’ss solution assumes a multivalued StrField with values counted
> using CountFieldValuesUpdateProcessorFactory, which doesn’t apply to you.
> You could instead count unique tokens in an analyzed field using the
> StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik
> Hatcher’s Lucene/Solr Revolution 2013 talk[4].
> >>
> >> Your script could look something like this (untested; replace "<field
> type>” with your field type):
> >>
> >> =====
> >> function getUniqueTokenCount(analyzer, fieldName, fieldValue) {
> >> var uniqueTokens = {};
> >> var stream = analyzer.tokenStream(fieldName, fieldValue);
> >> var termAttr =
> stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
> >> stream.reset();
> >> while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] =
> 1; }
> >> stream.end();
> >> stream.close();
> >> return Object.keys(uniqueTokens).length;
> >> }
> >> function processAdd(cmd) {
> >> var analyzer =
> req.getCore().getLatestSchema().getFieldTypeByName("<field
> type>").getIndexAnalyzer();
> >> doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer,
> null, content));
> >> }
> >> function processDelete(cmd) { }
> >> function processMergeIndexes(cmd) { }
> >> function processCommit(cmd) { }
> >> function processRollback(cmd) { }
> >> function finish() { }
> >> =====
> >>
> >> And your query could then look something like (replace "<field>” with
> your field name)[5][6]:
> >>
> >> =====
> >> fq={!frange l=0
> h=0}sub(unique_token_count_i,sum(termfreq(<field>,’CENTURY’),termfreq(<field>,’BANCORP’),termfreq(<field>,‘INC’)))
> >> =====
> >>
> >> Note that to construct the query ^^ you’ll need to tokenize and
> uniquify terms on the client side - if tokenization is non-trivial, you
> could use Solr's Field Analysis API[8] to perform tokenization for you.
> >>
> >> [1] https://issues.apache.org/jira/browse/SOLR-12673
> >> [2] https://wiki.apache.org/solr/ScriptUpdateProcessor
> >> [3]
> https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
> >> [4]
> https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
> >> [5]
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser
> >> [6]
> https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function
> >> [7]
> https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function
> >> [8]
> https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >>> On Sep 21, 2018, at 10:45 AM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >>>
> >>> A variant on Alexandre's approach is:
> >>> at index time, count the tokens that will be produced yourself (this
> >>> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
> >>> in your analysis for instance).
> >>> Put the number of tokens in a separate field
> >>> At query time, you'd search q=+company_name:(+century +bancorp +inc)
> >>> +tokens_in_company_name_field:3
> >>>
> >>> You don't need phrase queries with this approach, order doesn't matter.
> >>>
> >>> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
> >>> BANCORP, INCORPORATED." match?
> >>>
> >>> Again, though, this means your indexing code has to do the same thing
> >>> as your analysis chain. Which isn't very hard if the analysis chain is
> >>> simple. I might use a char _filter_ factory to remove all
> >>> non-alphanumeric characters, then a whitespace tokenizer and
> >>> (probably) a lowercasefilter. That's pretty easy to replicate in order
> >>> to count tokens.
> >>>
> >>> Best,
> >>> Erick
> >>> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
> >>> <arafalov@gmail.com> wrote:
> >>>>
> >>>> I think you can match everything in the query to the field using
> either
> >>>> 1) disMax/eDisMax with mm=100%
> >>>>
> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
> >>>> 2) Complex Phrase Query Parser with inOrder=false:
> >>>>
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
> >>>>
> >>>> The number of tokens though is hard. You only know what your tokens
> >>>> are at the end of the indexing pipeline. And during search, the tokens
> >>>> are looked up from their indexes and only then the documents are
> >>>> looked up.
> >>>>
> >>>> You may be able to do this with custom Postfilter that would run after
> >>>> everything else to just reject records with extra tokens. That would
> >>>> not be too expensive.
> >>>>
> >>>> Or (possibly simpler way) you could try to precalculate things, by
> >>>> writing a custom TokenFilter that takes a stream and returns token
> >>>> count to be used as a copyField target. Then you send your query to
> >>>> the same field with any full-query preserving syntax, either as a
> >>>> phrase or as a field query parser:
> >>>>
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
> >>>>
> >>>> I would love to know if any/all of this works for you.
> >>>>
> >>>> Regards,
> >>>> Alex.
> >>>>
> >>>> On 21 September 2018 at 09:00, marotosg <marotosg@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I have to search for company names where my first requirement is
to
> find
> >>>>> only exact matches on the company name.
> >>>>>
> >>>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't
> find "NEW
> >>>>> CENTURY BANCORP, INC."
> >>>>> because the result company has the extra keyword "NEW".
> >>>>>
> >>>>> I can't use exact match because the sequence of tokens may differ.
> Basically
> >>>>> I need to find results where the  tokens are the same in any order
> and the
> >>>>> number of tokens match.
> >>>>>
> >>>>> I have no idea if it's possible as include in the query the number
> of tokens
> >>>>> and solr field has that info within to match it.
> >>>>>
> >>>>> Thanks for your help
> >>>>> Sergio
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message