lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)
Date Wed, 02 Nov 2005 16:04:05 GMT
Not sure if this is feasible, but is there someway you could use a 
"fake" analyzer that you constructed using your hashtable/termvector and 
then have it output the tokens directly from the hashtable via the 
TokenStream?  Maybe you would have to pass in an empty/dummy string to 
the field constructor.   At least you wouldn't have to create the string 
of your term vectors

Just a thought,

Richard Jones wrote:

>I'm using lucene (which rocks, btw ;) behind the scenes at for 
>various things, and i've run into a situation that seems somewhat inelegant 
>regarding populating fields which i already know the termvector for.
>I'm creating a document for each user ( tracks music taste for people), 
>with a field that depicts a users favourite 500 artists. Each artist is 
>represented by an integer, here's a simple example with 3 artists:
>If i've listened to Radiohead (id 1) 10 times, Coldplay (id 2) 5 times and 
>Beck (id 3) 2 times, the field would look like this "1 1 1 1 1 1 1 1 1 1 2 2 
>2 2 2 3 3"
>I use this index for quickly finding "top fans" of an artist or combination of 
>artists, comparing peoples music taste and other things on the fly. 
>The issue is that i already have the termvecor (radiohead=10, coldplay=5, 
>beck=2) handy as a hashtable, and i've found myself building up a string of 
>numbers separated by spaces as shown above, then feeding this into lucene (i 
>store the termvec of the field in lucene).  Is there a way i could pass a 
>termvector directly to lucene to cut out the ugly "turn it into a string and 
>let lucene parse it" step? basically i want to provide the termvector for a 
>field when inserting a new document, rather than let lucene build it by 
>analyzing a string.
>This does feel like a rather perverted use of lucene i suppose.. It's faster 
>and less hassle than other methods i've tried to date though.
>To unsubscribe, e-mail:
>For additional commands, e-mail:

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 
Voice:  315-443-5484 
Fax: 315-443-6886 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message