lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Lucene handling of duplicate terms
Date Thu, 05 Sep 2013 07:46:20 GMT
Hi,

On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <krka@spotify.com> wrote:
> I have a use case where some of my documents have duplicate terms in
> various fields or within the same field.
>
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
>
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

I don't think we have existing queries that allow for doing it
efficiently (if someone reads this and knows it is wrong, please
correct!). However, it should be doable to implement such a query
rather easily by iterating over the postings lists of the 'foo' term
in all the fields you are interested in, suming up frequencies (the
index must have been created with IndexOptions.DOCS_AND_FREQS or
higher) and only keeping documents whose sum of frequencies is at
least 2.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message