lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <>
Subject Skewed IDF in multi lingual index
Date Thu, 08 Nov 2012 16:13:41 GMT

We're testing a large multi lingual index with _LANG fields for each language and using dismax
to query them all. Users provide, explicit or implicit, language preferences that we use for
either additive or multiplicative boosting on the language of the document. However, additive
boosting is not adequate because it cannot overcome the extremely high IDF values for the
same word in another language so regardless of the the preference, foreign documents are returned.
Multiplicative boosting solves this problem but has the other downside as it doesn't allow
us with standard qf=field^boost to prefer documents in another language above the preferred
language because the multiplicative is so strong. We do use the def function (boost=def(query($qq),.3))
to prevent one boost query to return 0 and thus a product of 0 for all boost queries. But
it doesn't help that much

This all comes down to IDF differences between the languages, even common words such as country
names like `india` show large differences in IDF. Is here anyone with some hints or experiences
to share about skewed IDF in such an index?


View raw message