lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Schulte <>
Subject Re: Filtering TermDocs and TermEnum
Date Wed, 28 Dec 2005 00:31:32 GMT
To apply statistical tools to the words

For example, say you have a large collection of news articles and you want
to know what words is appearing more often than usual today...

Then you could do a TermEnum limited to documents that were indexed today,
then you can do term enums for the previous 10 days, to find a mean and a
standard deviation for each of the words. Using this information you could
find which word is the most standard deviations over it's mean appearance
number for today, and get an idea of what words are relevant to active
stories today.

Or you wanted to see what words in your corpora of news articles were
related to the word 'foo'...

you could find the frequency for every word in the index only in documents
which match some TermQuery (like "contents:foo") then compare these
frequencies to the gross frequencies of every term in the index to find out
how relevant every term in the index is compared to foo.

On 12/27/05, Phoenix <> wrote:
> why ?
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message