lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: About counting term hits
Date Sun, 16 Nov 2008 01:35:57 GMT

: I think to do this efficiently you'd need to modify Lucene's builtin query
: classes (eg TermQuery) such that during the scoring process, in addition to
: simply computing its contribution to the document's score, it would also
: record further information like total number of occurrences of each term,
: which docs had which terms, etc.

Unless you don't care about the score at all, just the total count.  In 
which case you can use a custom Similarity to make the score for a doc be 
the count (ignore idf, norms, queryNorm, etc...)  Then use a hit collector 
that sums the counts for every doc matched.  

that should be as efficient as possible (it's certianly only one pass) but 
you might be able to optimize it by using your other criteria 
(date range or whatever) in a Filter to generate a BitSet, then fetch a 
TermDocs instance for your term, and iterate through the docs summing up 
the frequencies (you can use skipTo(set.nextSetBit()) to optimize away 
non-matching docs)

(at least i'm pretty sure that would work)

Another nuance to this question...

: > > > > I am new to LUCENE and I am testing some issues about it. I can
: > > > > retrieve
: > > > > the number of documents which satisfies a query, but I don't find how
: > > > > to
: > > > > obtain the number of terms which match it.

The words "term" and "query" mean very specific, and independent, things 
in lucene, ... but Mario seems to be using them interchangably -- if you 
want to know how often a Term filtered appears in all docs matching some 
criteria, then all of the techniques described so far should work.

but if you want to count the occurances of a more complicated Query (like: 
how many times does the phrase "Mario Barcala" appear in docs from 
199-2003) the situation gets more complicated ... for that you would want 
ot use something like a SpanQuery and iterate through the Spans (counting 
them as you go)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message