lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Counting and Categorisation
Date Thu, 08 Feb 2007 13:47:30 GMT

On Feb 8, 2007, at 8:28 AM, Kainth, Sachin wrote:

> This email is meant for Chris Hostetter and of course anyone else who
> may know about this,
> I wonder if I can ask you a question.  I have been reading of how  
> you at
> CNET have implemented categorisation and counting so that if i type
> "Kodak Easyshare" in the reviews section you not only get a big  
> list of
> all documents about this but you also get a list of categories in  
> which
> "Kodak EasyShare" appears.  So for example it will say that there are
> documents in the "Digital cameras" category which contain "Kodak
> EasyShare" and also documents in "Peripherals" with that same query.
> I'd like to do the same thing as this and I'm not sure I've fully
> understood the explainations I've read so far.  I know you have
> described using lots of bitsets to do this but I'm not too clear on  
> the
> details.
> Let me explain what I want to do.  It is very simple.  I have a Lucene
> index containing just 3 fields (I mean field in the sense that you can
> use the fieldName:searchTerm query syntax to search for the value
> searchTerm in fieldName).  The fields are "artist", "track" and  
> "album".
> What I want to do is if the user searches of the text "love" in the
> track field they get a list of all the artists who have a track with
> "love" in the title plus a list of all the albums with the word "love"
> in the title.  Along with these album and artist names I want a  
> count of
> the number of songs in each category.  If the user clicks on one of
> these categories then that result subset is returned.
> At the moment I just return the full list of artists, albums and  
> tracks
> which I want as well.  What I've described above will be in a top bar
> which will allow the user to refine their search.
> What I'm asking then is for some specific information about how I can
> perform the categorisation and counts.

There are two ways to go about this:

   1) Use Solr.

   2) If the number of unique artists and albums is reasonable  
enough, build BitSet's in memory into a Map.  When someone searches  
for "love" (and who doesn't?) get a BitSet of the matching documents  
(using a HitCollector, or QueryFilter) and intersect it with all the  
ones in the Map.  I use (but working on phasing things into a better  
fit with Solr and scalability) this same scheme on Collex at <http://> for all the facets on the right (though I do  
leverage some of Solr's goodies, my original implementation  
successfully used the BitSet-Map-in-memory scheme (now I use  
TermQuery's-in-memory, but leverage Solr's DocSet caching instead of  
BitSets).  Its very fast!   The cons to doing all this yourself is  
when things get bigger you gotta change how it works to scale, and  
Solr already has a lot of infrastructure in place for this eventuality.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message