lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristofer Karlsson <k...@spotify.com>
Subject Re: Lucene handling of duplicate terms
Date Thu, 05 Sep 2013 14:08:12 GMT
On Thu, Sep 5, 2013 at 3:40 PM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:

> On Thu, 2013-09-05 at 09:28 +0200, Kristofer Karlsson wrote:
> > For an example, I may have a million documents with just the term "foo"
> in
> > field A, and one particular document with the term "foo" in both field A
> > and B, or have two terms "foo" in the same field.
> >
> > If I search for "foo foo" I would like to filter out all the documents
> with
> > only one matching term - is this possible?
>
> A bit of creative querying should do it:
>
> For the "only one foo-field"-case, you could do
>   (A:foo NOT B:foo) OR (B:foo NOT A:foo)
>
> To avoid two foo's in the same field, you could do
>   NOT field:"foo foo"~1000
>
> Combining those we get
>   ((A:foo NOT B:foo) OR (B:foo NOT A:foo)) NOT A:"foo foo"~1000 NOT
> B:"foo foo"~1000
>
>
> Or did I misunderstand? Do you want to keep the documents that has at
> least two foo's and discard the ones that only has one? That is simpler:
>   (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000
>
>
> This all works under the assumption that you have less than 1000 terms
> in each instance of your fields. Adjust accordingly.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
> Yes, I meant that latter part - getting rid of hits that didn't actually
have as many occurrences of the term as the search query.
The query generation sort of works if I just have two fields. For more
fields and more search terms it quickly gets more complicated - it would be
a combinatorial explosion.

---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message