lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject RE: Scoring a document (count?)
Date Thu, 03 Aug 2006 22:03:34 GMT
Hi Russel,

I am also interested in the internals of Lucene's ranking and how one
can/should alter the scoring. For now I was just learning from existing
code of Lucene scorers and Weights. Your question seemed interesting, so I
in fact implemented a quick scorer that would return the raw tf as a score,
as an exercise. It is not a product level implementation of course, but if
you think this will help you (?) I can share the code. (Would have
responded sooner, for my working computer went off for a few days with a
fan error...:-).


"Russell M. Allen" <> wrote on 31/07/2006 07:35:50:

> Thank you for the reply Doran!  You are exactly right about the sql
> count(*).  I need the equivalent of group by, and count().
> We have considered a 'joined' index where we would have a document for
> each permutation.  We discarded it (possibly prematurely) based on the
> rapid explosion in the number of documents.  In our domain, we have
> movies as the main document type, and 5 other satellite document types
> with their own indexes: Star, Studio, Director, Series, and Category
> (genre).  With the exception of series, a movie has a many to many
> relationship with the other indexes.  So, with 60k movies, 20k stars, 2k
> studios, ... The document count quickly shoots through the roof.
> Also, the majority of our searching is based on a single domain type,
> such as movie.  It is only a small handful of corner cases where we want
> what amounts to a joined query.  If we merged these indexes, we would
> constantly have to 'roll up' the results into distinct instances of a
> type.  (The equivalent of an SQL 'group by')
> I find the parallels between the expressiveness of Lucene and SQL
> interesting.  I'm glad to see you compared what I was looking for to an
> sql count(*) as well.  We have a handful of indexing issues that I am
> attempting to solve/optimize, of which performing a count(*) is only
> one.  I also have the need to perform a JOIN across two indexes.  I have
> 'ideas' about how I might go about this, but for now we are fortunate
> enough to have fairly static data and half of the join is static.  As a
> result we can cache a bitset filter for the results of half the join and
> apply it to the other (dynamic) half of the join query.
> Anyway, I digress...
> I saw you second post regarding creating a scorer.  I'd like to continue
> down that path.  My main issue now is simply understanding how lucene
> works under the covers enough to write the TermQuery variant.
> Thanks for the help,
> Russell.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message