lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harini Raghavan" <>
Subject RE: Counting term frequency without using Explanation
Date Mon, 19 Feb 2007 15:32:19 GMT
Hi Erick,

I have a similar requirement to know the frequency of occurrence of a
keyword in a given content to find out the relevancy of the article to a set
of keywords. If the keyword is mentioned more than once in the article, then
I want to treat it as more relevant. 

Can you please point me to the threads you have mentioned?


-----Original Message-----
From: Erick Erickson [] 
Sent: Wednesday, February 07, 2007 8:19 PM
Subject: Re: Counting term frequency without using Explanation

Before you go too far down this path, please consider what a "hit" is. It's
more complicated than you think <G>.

If all you want to do is count up the number of times any term appears in
the document, it's not too hard. You should be able to use a
termenum/termdocs process to count them.

TermDocs should work, just seek to a term, skip to the document number
(which you'll have to get somewhere else), and keep adding to your count
while the docid is the same as your target. Repeat for each term.

But it's a much more complicated story if you want to accurately reflect a
query. For instance, consider a near query, that is terms within, say, 3 of
each other. If you do something like the above, you'll present "hits" that
aren't real. For instance...

a b c d e f g h i j a

if you search for a and c within 3 of each other, is this one hit? two? it
definitely isn't three which is what you'd get if you just counted the
occurrence of the terms a, b... What about a NOT clause? How does a phrase
query get counted?

There have been several discussions of various aspects of this issue, but
often in the context of highlighting. You'll probably get some good
information from the following threads...

Counting terms' hits from phrases
Counting hits in a document

as well as searching the archive on highlighting and/or hitcount


On 2/7/07, csahat <> wrote:
> Hi all,
>   I'm so sorry if this question already answered before in this list, 
> but I already search the list, and I couldn't find the answer.
>    This is what I want to do :
>   When the user type in the query, for example "WebSphere Java", 
> Lucene will show not only the score, but showing the term count per 
> document as well, like this
>   doc1    0.8333          websphere=3, Java = 2
>   doc2    0.817            websphere=2, Java=2
>   I already tried to implement with TermFreqVector, but TermFreqVector 
> show all the terms in the field, instead what I want is only the terms 
> that happen in the query.
> I already tried using TermDocs as well, but it always gave result 0.
>   I tried using Explanation class, using toString method, but I have 
> to "clean"
> the information.
>   Is there any "direct" way to do this in Lucene ?  Or perhaps someone 
> can give me a hint ?
>   Thanks in advance

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message