lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <>
Subject RE: Skewed IDF in multi lingual index
Date Fri, 09 Nov 2012 09:22:41 GMT
Robert, Tom,

That's it indeed! Using maxDoc as numerator opposed to docCount yields very skewed results
for an unevenly distributed multi-lingual index. We have one language dominating the other
twenty so the dominating language contains no rare terms compared to the others.

We're now checking results using docCount and it seems alright. I do have to get used to the
fact that document scores are now roughly 1000 times higher than before but i'm already very
happy with CollectionStatistics and will see if all works well.

Any other tips to share?


-----Original message-----
> From:Robert Muir <>
> Sent: Thu 08-Nov-2012 17:44
> To:
> Subject: Re: Skewed IDF in multi lingual index
> Hi Markus: how are the languages distributed across documents?
> Imagine I have a text_en field and a text_fr field. Lets say I have
> 100 documents, 95 are english and only 5 are french.
> So the text_en field is populated 95% of the time, and the text_fr 5%
> of the time.
> But the default IDF computation doesnt look at things this way: it
> always uses '100' as maxDoc. So in such a situation, any terms against
> text_fr are "rare" :)
> The first thing i would look at, is treating this situation as merging
> results from a english index with 95 docs and a french index with 5
> docs.
> So I would consider overriding the two idfExplain methods (term and
> phrase) to use CollectionStatistics.docCount() instead of
> CollectionStatistics.maxDoc()
> The former would be 95 for the english field (instead of 100), and 5
> for the french field (instead of 100).
> I dont think this will solve all your problems: but it might help.
> Note: you must ensure your index is fully upgraded to 4.0 to try this
> statistic, otherwise it will return -1 if you have any 3.x segments in
> your index.
> On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> <> wrote:
> > Hi,
> >
> > We're testing a large multi lingual index with _LANG fields for each language and
using dismax to query them all. Users provide, explicit or implicit, language preferences
that we use for either additive or multiplicative boosting on the language of the document.
However, additive boosting is not adequate because it cannot overcome the extremely high IDF
values for the same word in another language so regardless of the the preference, foreign
documents are returned. Multiplicative boosting solves this problem but has the other downside
as it doesn't allow us with standard qf=field^boost to prefer documents in another language
above the preferred language because the multiplicative is so strong. We do use the def function
(boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a product of 0
for all boost queries. But it doesn't help that much
> >
> > This all comes down to IDF differences between the languages, even common words
such as country names like `india` show large differences in IDF. Is here anyone with some
hints or experiences to share about skewed IDF in such an index?
> >
> > Thanks,
> > Markus

View raw message