lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stamatis Zampetakis <zabe...@gmail.com>
Subject Re: Semantics and performance regarding min number of the optional BooleanClauses
Date Mon, 30 Mar 2020 22:20:25 GMT
Thanks Adrien, indeed I missed the difference in the score since in my
context I do not need them most of the time.

I will try to check if there is something that can be done with respect to
the rewriting and if something promising comes up I will let you know.

Best,
Stamatis


On Mon, Mar 30, 2020, 7:20 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Stamatis,
>
> One thing that you missed regarding semantics is scoring. While (A B C)~2
> and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would
> produce different scores.
>
> Moreover, many users come to this query because it is exactly what they
> need: matching k out of n clauses. In the example you gave it's pretty
> simple because there are only 3 clauses, but try to see what the generated
> query looks like when matching 3 out of 5 clauses, it's already very
> complex.
>
> It would be nice if we could rewrite the expanded form into the variant
> that sets a minimum number of matching should clauses, which should be more
> efficient. My worry is that it would be quite expensive to do, maybe to the
> point that it would more hurt than help on average. I'd be very happy to be
> proven wrong though, if we can cheaply rewrite the expanded form, this
> would be a good addition.
>
> On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis <zabetak@gmail.com>
> wrote:
>
> > Hi all,
> >
> > My question concerns the method setMinimumNumberShouldMatch in
> BooleaQuery
> > class.
> >
> > Lets assume that we have 3 queries (optional clauses), namely A, B, C and
> > we build a BooleanQuery specifying that at least 2 should match.
> >
> > In terms of semantics what I understand so far is that
> >
> > (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).
> >
> > In other words a single BooleaQuery with a min should match parameter
> could
> > be rewritten as pure disjunctive BooleanQuery comprised from 3
> sub-queries.
> >
> > In terms of performance it seems that the two queries present different
> > behavior so the minMatch property is not only syntactic sugar and
> > apparently there is no rewriting between the two.
> >
> > Coming from the SQL world it is a bit hard for me to justify the addition
> > of a new operator that looks like syntactic sugar and at the same time is
> > more performant than the more primitive equivalents. I looked a bit in
> [1]
> > to understand motivation for adding this API but without much success.
> >
> > Summing up everything to three questions:
> > 1. Did I get right the semantics of this extra property or there are
> things
> > that I am missing?
> > (If my understanding is correct)
> > 2. What's the reason for introducing the minMatch property in the first
> > place? (Avoid creating huge queries?)
> > 3. Should the performance of the two queries shown above differ?
> >
> > Thanks in advance!
> >
> > Best,
> > Stamatis
> >
> > [1] https://issues.apache.org/jira/browse/LUCENE-395
> >
>
>
> --
> Adrien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message