lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Bickerstaff <j...@johnbickerstaff.com>
Subject Re: Want zero results from SOLR when there are no matches for "querystring"
Date Fri, 12 Aug 2016 18:09:31 GMT
Thanks!  I'll check it out.

On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2777@gmail.com>
wrote:

> Not exactly sure what you are looking from chaining the results but similar
> functionality is available in Streaming expressions where result of inner
> expressions are passed to outer expressions and so on
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>
> HTH
> Susheel
>
> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> john@johnbickerstaff.com>
> wrote:
>
> > Hossman - many thanks again for your comprehensive and very helpful
> answer!
> >
> > All,
> >
> > I am (possibly mis-remembering) reading something about being able to
> pass
> > the results of one query to another query...  Essentially "chaining"
> result
> > sets.
> >
> > I have looked in docs and can't find anything on a quick search -- I may
> > have been reading about the Re-Ranking feature, which doesn't help me (I
> > know because I just tried and it seems to return all results anyway, just
> > re-ranking the number specified in the reRankDocs flag...)
> >
> > Is there a way to (cleanly) send the results of one query to another
> query
> > for further processing?  Essentially, pass ONLY the results (including an
> > empty set of results) to another query for processing?
> >
> > thanks...
> >
> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> > john@johnbickerstaff.com>
> > wrote:
> >
> > > Thanks!
> > >
> > > To answer your questions, while I digest the rest of that
> information...
> > >
> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> > > https://github.com/healthonnet/hon-lucene-synonyms
> > >
> > > The config looks like this - and IIRC, is simply a copy from the
> > > recommended cofig on the site mentioned above.
> > >
> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
> > search.
> > > SynonymExpandingExtendedDismaxQParserPlugin">
> > >     <!-- You can define more than one synonym analyzer in the following
> > > list.
> > >          For example, you might have one set of synonyms for English,
> one
> > > for French,
> > >          one for Spanish, etc.
> > >       -->
> > >     <lst name="synonymAnalyzers">
> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> > > "analyzer_fr", "analyzer_es", etc.
> > >            If you only have one, the name doesn't matter (hence
> > > "myCoolAnalyzer").
> > >         -->
> > >       <lst name="myCoolAnalyzer">
> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> based
> > > on whitespace and quotes.
> > >              This seems to work best with most people's synonym files.
> > >              For details, read the discussion here:
> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> > >           -->
> > >         <lst name="tokenizer">
> > >           <str name="class">solr.PatternTokenizerFactory</str>
> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> > >         </lst>
> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> token
> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> > >              The default here is to assume you don't have any synonyms
> > > longer than 4 tokens.
> > >              You can tweak this depending on what your synonyms look
> > like.
> > > E.g. if you only have unigrams, you can remove
> > >              it entirely, and if your synonyms are up to 7 tokens in
> > > length, you should set the maxShingleSize to 7.
> > >           -->
> > >         <lst name="filter">
> > >           <str name="class">solr.ShingleFilterFactory</str>
> > >           <str name="outputUnigramsIfNoShingles">true</str>
> > >           <str name="outputUnigrams">true</str>
> > >           <str name="minShingleSize">2</str>
> > >           <str name="maxShingleSize">4</str>
> > >         </lst>
> > >         <!-- This is where you set your synonym file.  For the unit
> tests
> > > and "Getting Started" examples, we use example_synonym_file.txt.
> > >              This plugin will work best if you keep expand set to true
> > and
> > > have all your synonyms comma-separated (rather than =>-separated).
> > >           -->
> > >         <lst name="filter">
> > >           <str name="class">solr.SynonymFilterFactory</str>
> > >           <str name="tokenizerFactory">solr.
> > KeywordTokenizerFactory</str>
> > >           <str name="synonyms">example_synonym_file.txt</str>
> > >           <str name="expand">true</str>
> > >           <str name="ignoreCase">true</str>
> > >         </lst>
> > >       </lst>
> > >     </lst>
> > >   </queryParser>
> > >
> > >
> > >
> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> > hossman_lucene@fucit.org
> > > > wrote:
> > >
> > >>
> > >> : First let me say that this is very possibly the "x - y problem" so
> let
> > >> me
> > >> : state up front what my ultimate need is -- then I'll ask about the
> > >> thing I
> > >> : imagine might help...  which, of course, is heavily biased in the
> > >> direction
> > >> : of my experience coding Java and writing SQL...
> > >>
> > >> Thank you so much for asking your question this way!
> > >>
> > >> Right off the bat, the background you've provided seems supicious...
> > >>
> > >> : I have a piece of a query that calculates a score based on a
> > "weighting"
> > >>         ...
> > >> : The specific line is this:
> > >> : <str name="bf">product(field(category_weight),20)</str>
> > >> :
> > >> : What I just realized is that when I query Solr for a string that has
> > NO
> > >> : matches in the entire corpus, I still get a slew of results because
> > >> EVERY
> > >> : doc has the weighting value in the category_weight field - and
> > therefore
> > >> : every doc gets some score.
> > >>
> > >> ...that is *NOT* how dismax and edisamx normally work.
> > >>
> > >> While both the "bf" abd "bq" params result in "additive" boosting, and
> > the
> > >> implementation of that "additive boost" comes from adding new optional
> > >> clauses to the top level BooleanQuery that is executed, that only
> > happens
> > >> after the "main" query (from your "q" param) is added to that top
> level
> > >> BooleanQuery as a "mandaory" clause.
> > >>
> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost every
> > doc,
> > >> but with the techprducts configs/data these requests still don't match
> > >> anything...
> > >>
> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> > >>
> > >> ...and if you look at the debug output, the parsed queries shows that
> > the
> > >> "bogus" part of the query is mandatory...
> > >>
> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> > >> FunctionQuery(const(true))
> > >>
> > >> (i didn't use "pf" in that example, but the effect is the same, the
> "pf"
> > >> based clauses are optional, while the "qf" based clauses are
> mandatory)
> > >>
> > >> If you compare that example to your debug output, you'll notice a
> > >> difference in structure -- it's a bit hard to see in your example, but
> > if
> > >> you simplify your qf, pf, and q fields it should be more obvious, but
> > >> AFAICT the "main" parts of your query are getting wrapped in an extra
> > >> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory
> in
> > >> the top level query ... i don't see *any* mandatory clauses in your
> top
> > >> level BooleanQuery, which is why any match on a bf or bq function is
> > >> enough to cause a document to match.
> > >>
> > >> I suspect the reason your parsed query structure is so diff has to do
> > with
> > >> this...
> > >>
> > >> :        <str name="defType">synonym_edismax</str>>
> > >>
> > >>
> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> > >> 2) what QParserPlugin are you using to implement that?
> > >>
> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> > >>
> > >>
> > >> If you can't fix the bug, one possibile workaround would be to abandon
> > bf
> > >> and bq params completely, and instead wrap the query it produces in
> in a
> > >> {!boost} parser with whatever function you want (using functions like
> > >> sum() or prod() to combine multiple functions, and query() to
> > incorporate
> > >> your current bq param).  Doing this will require chanign how you
> specify
> > >> you input (example below) and it will result in *multiplicitive*
> boosts
> > --
> > >> so your scores will be much diff, and you will likely have to adjust
> > your
> > >> constants, but: 1) multiplicitive boosts are almost always what people
> > >> *really* want anyway; 2) it will ensure the boosts are only applied
> for
> > >> things matching your main query, no matter how that query parser works
> > or
> > >> what bugs it has.
> > >>
> > >> Example of using {!boost} to wrap an arbitrary other parser...
> > >>
> > >> instead of...
> > >>   defType=foofoo
> > >>   q=barbarbar
> > >>
> > >> use...
> > >>    q={!boost b=$func defType=foofoo v=$qq}
> > >>   qq=barbarbar
> > >> func=sum(something,somethingelse)
> > >>
> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> > >>
> > >>
> > >>
> > >>
> > >> :
> > >> : What I would like is to return zero results if there is no match for
> > the
> > >> : querystring.  My collection is small enough that I don't care if the
> > >> actual
> > >> : calculation runs on each doc (although that's wasteful) -- I just
> > don't
> > >> : want to see results come back for zero matches to the querystring
> > >> :
> > >> : (The /select endpoint does this of course, but my custom endpoint
> > >> includes
> > >> : this "weighting" piece and therefore returns every doc in the corpus
> > >> : because they all have the weighting.
> > >> :
> > >> : ====================
> > >> : Enter my imagined solution...  The potential X-Y problem...
> > >> : ====================
> > >> :
> > >> : So - given that I come from a programming background, I immediately
> > >> start
> > >> : thinking of an if statement ...
> > >> :
> > >> :      if(some_score_for_the_primary_search_string) {
> > >> :           run_the_category_weight_calculation;
> > >> :      } else {
> > >> :           do_NOT_run_category_weight_calc;
> > >> :      }
> > >> :
> > >> :
> > >> : Another way of thinking of it would be something like the "WHERE"
> > >> clause in
> > >> : SQL...
> > >> :
> > >> :  run_category_weight_calculation WHERE "searchstring" is found in
> the
> > >> : document, not otherwise.
> > >> :
> > >> : I'm aware that things could be handled in the client-side of my web
> > app,
> > >> : but if possible, I'd like the interface to SOLR to be as clean as
> > >> possible,
> > >> : and massage incoming SOLR data as little as possible.
> > >> :
> > >> : In other words, do NOT return any docs if the querystring (and any
> > >> : synonyms) match zero docs.
> > >> :
> > >> : Here is the endpoint XML for the query.  I've highlighted the
> specific
> > >> line
> > >> : that is causing the unintended results...
> > >> :
> > >> :
> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> > >> :     <!-- default values for query parameters can be specified, these
> > >> :          will be overridden by parameters in the request
> > >> :       -->
> > >> :      <lst name="defaults">
> > >> :        <str name="echoParams">all</str>
> > >> :        <int name="rows">20</int>
> > >> :        <!-- Query settings -->
> > >> :        <str name="df">text</str>
> > >> :       <!-- <str name="df">title</str> -->
> > >> :        <str name="defType">synonym_edismax</str>>
> > >> :        <str name="synonyms">true</str>
> > >> :     <!-- The line below balances out the weighting of exact matches
> to
> > >> the
> > >> : synonym phrase entered by the user
> > >> :          with the category_weight calculation and the titleQuery
> calc.
> > >> : These numbers exist in a balance and
> > >> :          if one is raised or lowered, the others (probably) need to
> > >> change
> > >> : as well.  It may be better to go with decimals
> > >> :          for all of them... .4 instead of 4 and 2 instead of 20 and
> > 2.5
> > >> : instead of 25.
> > >> :          In the end, I'm not sure it really matters, but don't
> change
> > >> one
> > >> : without changing the others
> > >> :          unless you've tested and are sure you want the results  -->
> > >> :        <float name="synonyms.originalBoost">1.5</float>
> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> > >> :        <str name="mm">75%</str>
> > >> :        <str name="q.alt">*:*</str>
> > >> :        <str name="rows">20</str>
> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> > >> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> > >> : v=$q}</str>
> > >> :        <str name="fl">id category_weight title category_ss score
> > >> : contentType</str>
> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> > >> v=$q}</str>
> > >> : =====================================================
> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
> > >> : =====================================================
> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> > >> :        <str name="qf">text contentType^1000</str>
> > >> :        <str name="wt">python</str>
> > >> :        <str name="debug">true</str>
> > >> :        <str name="debug.explain.structured">true</str>
> > >> :        <str name="indent">true</str>
> > >> :        <str name="echoParams">all</str>
> > >> :      </lst>
> > >> :   </requestHandler>
> > >> :
> > >> : And here is the debug output for a query.  (This was a test for
> > >> synonyms,
> > >> : which you'll see in the output.) The original query string was, of
> > >> : course, "μ-heavy
> > >> : chain disease"
> > >> :
> > >> : You'll note that although there is no score in the first doc explain
> > for
> > >> : the actual querystring, the highlighted section does get a score for
> > >> : product(double(category_weight)=1.5,const(20))
> > >> :
> > >> : ... which is the thing that is currently causing all the docs in the
> > >> : collection to "match" even though the querystring is not in any of
> > them.
> > >> :
> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> > >> : "querystring":"\"μ-heavy
> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
> heavy
> > >> chain
> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> > >> (contentType:\"mu
> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
> heavy
> > >> chain
> > >> : disease\" | (contentType:\"μ heavy chain
> > disease\")^1000.0)))/no_coord^
> > >> 1.1)
> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
> > heavy
> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> chain
> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> chain
> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> > >> : hcd\")))/no_coord^1.1)))
> > >> : FunctionQuery(product(double(category_weight),const(20)))
> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((
> text:\"μ
> > >> heavy
> > >> : chain disease\" | (contentType:\"μ heavy chain
> disease\")^1000.0))^1.5
> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> > >> (contentType:\"μ
> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> > >> (contentType:\"μ
> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> hcd\"))^1.1)
> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> > hcd\"))^1.1)))
> > >> : product(double(category_weight),const(20))
> product(query(+(title:\"μ
> > >> heavy
> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> "value":30.0, "
> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
> > >> : description":"FunctionQuery(product(double(category_weight),
> > >> const(20))),
> > >> : product of:",
> > >> : =====================================================
> > >> : *"details":**[{ "match":true, "value":30.0,
> > >> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
> > >> : =====================================================
> > >> :
> > >> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
> > >> "value":
> > >> : 1.0, "description":"queryNorm"}]}, {
> > >> :
> > >>
> > >> -Hoss
> > >> http://www.lucidworks.com/
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message