lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: "Field weights"
Date Fri, 14 Dec 2007 18:03:55 GMT

This might work for you:

Paul Elschot

On Friday 14 December 2007 18:06:01 Karl Wettin wrote:
> I have an index that contains three sorts of documents:
> Car brand
> Tire brand
> Tire pressure
> (Please bear with me, the real index has nothing to do with cars. I  
> just try to explain the problem in an alternative domain to avoid NDA  
> conflicts.)
> There is a heirarchial composite relationship between these sort of  
> documents. A document describing "tire pressure" also contains "tire  
> brand" and "car brand". A document describing "tire brand" also  
> contains information about "car brand". A document describing "car  
> brand" contains only that.
> The requirement is that the consumer of the API should not have to  
> specify what fields they are searching in. There is no time (nor  
> training data) to implement a hidden markov model (HMM) tokenizer or  
> something along that path in order to extract possible attributes from  
> the query string. Instead the query string is tokenized once per field  
> and assebled to one huge query. Normally this works fairly well.
> Here are some example documents:
> Volvo
> Volvo, Michelin
> Volvo, Nokian
> Volvo, Nokian, 2.2 bars
> Volvo, Firestone, 2.4 bars
> Saab
> Saab, Michelin
> Saab, Nokian
> Saab, Nokian, 2.1 bars
> Saab, Firestone
> Saab, Firestone, 2.4 bars
> Saab, Firestone, 2.5 bars
> If I search for Saab the top result will be the document  representing  
> the car brand "Saab".  The query would look like this: "car:saab  
> tire:saab preasure:saab"
> But lets say Saab starts manufacturing tires too:
> Saab
> Saab, Saab tires
> Saab, Saab tires, 1.9 bars
> Saab, Saab tires, 1.8 bars
> If I search for "Saab" I still want the top result to be Saab the car  
> brand. But  it not longer is, the match for "Saab, Saab tires" now  
> have a greater score than "Saab", of course.
> My idea is to work along the line of indexing "Saab" in the tire brand  
> and tire pressure field too. Now searching for Saab will yeild a  
> result where the car brand "Saab" is the top result.
> However, this will not work as I have different tokenization  
> strategies for each field (stemming and what not). Tokenizing the  
> query string Saab for the field "tire brand" in Swedish might end up  
> as "saa" and will thus not find the token Saab inserted for the  
> document describing the car brand Saab.
> I have a couple of experiments in my head I need to try out, starting  
> with tokezining query strings per field and using the tokens generated  
> for the field car brand as query in the tire brand and tire pressure  
> too. And vice versus.
> Any brilliant ideas that might work? Hacky solutions are OK.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message