lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex K <aklib...@gmail.com>
Subject Re: Optimizing term-occurrence counting (code included)
Date Fri, 24 Jul 2020 11:59:18 GMT
Thanks Ali. I don't think that will work in this case, since the data I'm
counting is managed by lucene, but that looks like an interesting project.
-Alex

On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote:

> I'm new to lucene so I'm not sure what the best way of speeding this up in
> Lucene is, but I've previously used https://github.com/npgall/cqengine for
> similar stuff. It provided really good performance, especially if you're
> just counting things.
>
> On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklibisz@gmail.com> wrote:
>
> > Hi all,
> >
> > I am working on a query that takes a set of terms, finds all documents
> > containing at least one of those terms, computes a subset of candidate
> docs
> > with the most matching terms, and applies a user-provided scoring
> function
> > to each of the candidate docs
> >
> > Simple example of the query:
> > - query terms ("aaa", "bbb")
> > - indexed docs with terms:
> >   docId 0 has terms ("aaa", "bbb")
> >   docId 1 has terms ("aaa", "ccc")
> > - number of top candidates = 1
> > - simple scoring function score(docId) = docId + 10
> > The query first builds a count array [2, 1], because docId 0 contains two
> > matching terms and docId 1 contains 1 matching term.
> > Then it picks docId 0 as the candidate subset.
> > Then it applies the scoring function, returning a score of 10 for docId
> 0.
> >
> > The main bottleneck right now is doing the initial counting, i.e. the
> part
> > that returns the [2, 1] array.
> >
> > I first started by using a BoolQuery containing a Should clause for every
> > Term, so the returned score was the count. This was simple but very slow.
> > Then I got a substantial speedup by copying and modifying the
> > TermInSetQuery so that it tracks the number of times each docId contains
> a
> > query term. The main construct here seems to be PrefixCodedTerms.
> >
> > At this point I'm not sure if there's any faster construct, or perhaps a
> > more optimal way to use PrefixCodedTerms?
> >
> > Here is the specific query, highlighting some specific parts of the code:
> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> > - Count the matching terms in a segment (this is the main bottleneck in
> my
> > query):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
> >
> > I appreciate any suggestions you might have.
> >
> > - Alex
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message