lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <>
Subject Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)
Date Wed, 02 Nov 2005 13:54:27 GMT
Richard Jones wrote:
> Hi,
> I'm using lucene (which rocks, btw ;) behind the scenes at for 
> various things, and i've run into a situation that seems somewhat inelegant 
> regarding populating fields which i already know the termvector for.
> I'm creating a document for each user ( tracks music taste for people), 
> with a field that depicts a users favourite 500 artists. Each artist is 
> represented by an integer, here's a simple example with 3 artists:
> If i've listened to Radiohead (id 1) 10 times, Coldplay (id 2) 5 times and 
> Beck (id 3) 2 times, the field would look like this "1 1 1 1 1 1 1 1 1 1 2 2 
> 2 2 2 3 3"
> I use this index for quickly finding "top fans" of an artist or combination of 
> artists, comparing peoples music taste and other things on the fly. 
> The issue is that i already have the termvecor (radiohead=10, coldplay=5, 
> beck=2) handy as a hashtable, and i've found myself building up a string of 
> numbers separated by spaces as shown above, then feeding this into lucene (i 
> store the termvec of the field in lucene).  Is there a way i could pass a 
> termvector directly to lucene to cut out the ugly "turn it into a string and 
> let lucene parse it" step? basically i want to provide the termvector for a 
> field when inserting a new document, rather than let lucene build it by 
> analyzing a string.

I can think of a few ways.  If elegance is your goal, then a little 
relational database theory might help.  Specifically, instead of having 
one record per listener, have one record per listener-artist 
combination, with three fields:  listenerid, artistid, and count.  Your 
example above would then look like
listenerid  artistid  count
----------  --------  -----
          X         1  00010
          X         2  00005
          X         3  00002

You could compose queries to get all artists somebody every listened to 
(listenerid:X), all Radiohead listeners (artistid:1), anybody who 
listened to Coldplay 5 or more times (artistid:2 and count:[00005 to 
99999]) or what-have-you.  This approach would require two-stage 
processing for queries of the form "find everybody who listened to 
Radiohead three times and Coldplay twice", though.

Really, though, your problem sounds more like a relational db problem 
than a text search problem.  A simple MySQL database with a few tables 
might be a better fit ...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message