lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <>
Subject Re: Skewed IDF in multi lingual index, again
Date Thu, 30 Nov 2017 16:42:15 GMT
Expanding the query to use both the tagged and untagged term might work. I’m not sure the
effect would be a lot different than boosting the preferred language.

Walter Underwood  (my blog)

> On Nov 30, 2017, at 8:35 AM, Markus Jelsma <> wrote:
> This is unfortunately not what we want. Some customers use filters to restrict language,
but some customers don't. They want to be able to find documents in all languages, so we use
user preference to get their local language on top. Except for very relevant documents in
foreign languages, hence the deboost is not too low.
> Thanks,
> Markus
> -----Original message-----
>> From:Walter Underwood <>
>> Sent: Thursday 30th November 2017 17:29
>> To:
>> Subject: Re: Skewed IDF in multi lingual index, again
>> I’ve occasionally considered using Unicode language tags (U+E001 and friends) on
each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet,
[de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to
the same language. If the entire document is in one language, might as well use a filter query
for that language. The tags would work for multiple languages in one document.
>> Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”,
the untagged one would have worse idf.
>> wunder
>> Walter Underwood
>>  (my blog)
>>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <>
>>> Hello,
>>> We already discussed this problem five years ago [1]. In short: documents in
foreign languages are scored higher for some terms.
>>> It was solved back then by using docCount instead of maxDoc when calculating
idf, it worked really well! But, probably due to index changes, the problem is back for some
terms, mostly proper nouns, well, just like five years ago.
>>> We already deboost documents by 0.7 that are not in the user's preference language
but in some cases it is not enough. I can go on by reducing that boost but that's not what
i prefer.
>>> I'd like to know if there are additional tricks to solve the problem.
>>> Many thanks!
>>> Markus
>>> [1]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message