lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: "Field weights"
Date Mon, 17 Dec 2007 01:31:34 GMT
Thanks for the replies.

I miserably failed to explain my problem. I focused too much on the  
content similarity between the fields and completely missed out on  
things such as that car brands and tire brands are not propotionally  
large compared to the data in my domain. I have tens of thousands  
more  "tire brands" than "car brands".

My quick and dirty solution (that fixes the problem with the quick and  
dirty solution that avoids implementing the query tokenizer mentioned  
in my original post) is based on what I wrote in the previous  
paragrah: there are so few "car brands" I can handle queries for them  
using a Map<String, CarBrand> and then merge the hit (if it was a hit)  
with the results from the Lucene index that now focus on "tire brand"  
and "tire preassure".

Quite ugly. But good enough for a couple of more weeks. I'll report  
back if I change my mind.

14 dec 2007 kl. 19.24 skrev Doron Cohen:

> It seems that documents having less fields satisfying
> the query worth more than those satisfying more fields
> of the query, because the first ones are more "to
> the point".
> At least it seems like it in the example.
> If this makes sense I would try to compose a top level
> boolean query out of the one-field queries, and make it use
> a similarity that implements this strange logic of coord.
> (Strange, because usually coord punishes for under
> matching, while here you want to reward for just that.)
> A tricky part would be to make the sub-queries use
> the "regular" coord logic, but first let's see if this is
> a valid direction at all.
> Doron
> On Dec 14, 2007 8:03 PM, Paul Elschot <> wrote:
>> Karl,
>> This might work for you:
>> Regards,
>> Paul Elschot
>> On Friday 14 December 2007 18:06:01 Karl Wettin wrote:
>>> I have an index that contains three sorts of documents:
>>> Car brand
>>> Tire brand
>>> Tire pressure
>>> (Please bear with me, the real index has nothing to do with cars. I
>>> just try to explain the problem in an alternative domain to avoid  
>>> NDA
>>> conflicts.)
>>> There is a heirarchial composite relationship between these sort of
>>> documents. A document describing "tire pressure" also contains "tire
>>> brand" and "car brand". A document describing "tire brand" also
>>> contains information about "car brand". A document describing "car
>>> brand" contains only that.
>>> The requirement is that the consumer of the API should not have to
>>> specify what fields they are searching in. There is no time (nor
>>> training data) to implement a hidden markov model (HMM) tokenizer or
>>> something along that path in order to extract possible attributes  
>>> from
>>> the query string. Instead the query string is tokenized once per  
>>> field
>>> and assebled to one huge query. Normally this works fairly well.
>>> Here are some example documents:
>>> Volvo
>>> Volvo, Michelin
>>> Volvo, Nokian
>>> Volvo, Nokian, 2.2 bars
>>> Volvo, Firestone, 2.4 bars
>>> Saab
>>> Saab, Michelin
>>> Saab, Nokian
>>> Saab, Nokian, 2.1 bars
>>> Saab, Firestone
>>> Saab, Firestone, 2.4 bars
>>> Saab, Firestone, 2.5 bars
>>> If I search for Saab the top result will be the document   
>>> representing
>>> the car brand "Saab".  The query would look like this: "car:saab
>>> tire:saab preasure:saab"
>>> But lets say Saab starts manufacturing tires too:
>>> Saab
>>> Saab, Saab tires
>>> Saab, Saab tires, 1.9 bars
>>> Saab, Saab tires, 1.8 bars
>>> If I search for "Saab" I still want the top result to be Saab the  
>>> car
>>> brand. But  it not longer is, the match for "Saab, Saab tires" now
>>> have a greater score than "Saab", of course.
>>> My idea is to work along the line of indexing "Saab" in the tire  
>>> brand
>>> and tire pressure field too. Now searching for Saab will yeild a
>>> result where the car brand "Saab" is the top result.
>>> However, this will not work as I have different tokenization
>>> strategies for each field (stemming and what not). Tokenizing the
>>> query string Saab for the field "tire brand" in Swedish might end up
>>> as "saa" and will thus not find the token Saab inserted for the
>>> document describing the car brand Saab.
>>> I have a couple of experiments in my head I need to try out,  
>>> starting
>>> with tokezining query strings per field and using the tokens  
>>> generated
>>> for the field car brand as query in the tire brand and tire pressure
>>> too. And vice versus.
>>> Any brilliant ideas that might work? Hacky solutions are OK.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message