lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TK Solr <tksol...@sonic.net>
Subject Re: Solr Multilingual Indexing with one field- Guidance
Date Wed, 13 May 2015 02:56:02 GMT

On 5/7/15, 11:23 AM, Kuntal Ganguly wrote:
> 1) Is this a correct approach to do it? Or i'm missing something?
Does the user wants to see the documents that he/she doesn't understand?
The words such as "doctor", "taxi", etc. are common among many languages in Europe.
Would the Spanish user wants to see English documents?
Of course this issue can be worked-around by having a separate language field.

How do you handle word collision among languages ?
"kind" in German means "child" in English. If a German user search for articles
about children, they will find lots of unrelated English
articles about someone being kind.
This one too can be worked-around by having a language field.

By default, Solr/Lucene hits are sort by the relevancy scores and
the score calculation uses IDF. If a search term appears in many documents,
the score is low. Because virtually all German documents have "die", the particle,
the score of the English word "die" will be low also.
>
> 2) Can you give me an example where there will be problem with this above
> new field type? A use-case/scenario with example will be very helpful.

If you have lots of Japanese documents indexed, try searching "京都" (Kyoto).
You will find many documents about Tokyo (東京) because the government
of the metropolitan Tokyo area is spelled as "東京都" = Tokyo Capital, which
generates two bigrams, 東京 and 京都.

Kuro




Mime
View raw message