lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject "Field weights"
Date Fri, 14 Dec 2007 17:06:01 GMT
I have an index that contains three sorts of documents:

Car brand
Tire brand
Tire pressure

(Please bear with me, the real index has nothing to do with cars. I  
just try to explain the problem in an alternative domain to avoid NDA  

There is a heirarchial composite relationship between these sort of  
documents. A document describing "tire pressure" also contains "tire  
brand" and "car brand". A document describing "tire brand" also  
contains information about "car brand". A document describing "car  
brand" contains only that.

The requirement is that the consumer of the API should not have to  
specify what fields they are searching in. There is no time (nor  
training data) to implement a hidden markov model (HMM) tokenizer or  
something along that path in order to extract possible attributes from  
the query string. Instead the query string is tokenized once per field  
and assebled to one huge query. Normally this works fairly well.

Here are some example documents:

Volvo, Michelin
Volvo, Nokian
Volvo, Nokian, 2.2 bars
Volvo, Firestone, 2.4 bars

Saab, Michelin
Saab, Nokian
Saab, Nokian, 2.1 bars
Saab, Firestone
Saab, Firestone, 2.4 bars
Saab, Firestone, 2.5 bars

If I search for Saab the top result will be the document  representing  
the car brand "Saab".  The query would look like this: "car:saab  
tire:saab preasure:saab"

But lets say Saab starts manufacturing tires too:

Saab, Saab tires
Saab, Saab tires, 1.9 bars
Saab, Saab tires, 1.8 bars

If I search for "Saab" I still want the top result to be Saab the car  
brand. But  it not longer is, the match for "Saab, Saab tires" now  
have a greater score than "Saab", of course.

My idea is to work along the line of indexing "Saab" in the tire brand  
and tire pressure field too. Now searching for Saab will yeild a  
result where the car brand "Saab" is the top result.

However, this will not work as I have different tokenization  
strategies for each field (stemming and what not). Tokenizing the  
query string Saab for the field "tire brand" in Swedish might end up  
as "saa" and will thus not find the token Saab inserted for the  
document describing the car brand Saab.

I have a couple of experiments in my head I need to try out, starting  
with tokezining query strings per field and using the tokens generated  
for the field car brand as query in the tire brand and tire pressure  
too. And vice versus.

Any brilliant ideas that might work? Hacky solutions are OK.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message