lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: Lucene handling of duplicate terms
Date Thu, 05 Sep 2013 13:40:52 GMT
On Thu, 2013-09-05 at 09:28 +0200, Kristofer Karlsson wrote:
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
> 
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

A bit of creative querying should do it:

For the "only one foo-field"-case, you could do
  (A:foo NOT B:foo) OR (B:foo NOT A:foo)

To avoid two foo's in the same field, you could do
  NOT field:"foo foo"~1000

Combining those we get
  ((A:foo NOT B:foo) OR (B:foo NOT A:foo)) NOT A:"foo foo"~1000 NOT
B:"foo foo"~1000


Or did I misunderstand? Do you want to keep the documents that has at
least two foo's and discard the ones that only has one? That is simpler:
  (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000


This all works under the assumption that you have less than 1000 terms
in each instance of your fields. Adjust accordingly.

- Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message