lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uri Boness <ubon...@gmail.com>
Subject Re: multi-language search
Date Mon, 24 Aug 2009 22:01:05 GMT
I can think of ways to tackle your problem:

Option 1: each document will have a field indicating its language. Then, 
when searching, you can simply filter the query on the language you're 
searching on. Advantages: everything is in one index, so if in the 
future you will need to do a cross language search you'll be able to do 
that without changing anything. Disadvantages: Well, depending on how 
your data is structured, your index can grow big - now if you always 
search only on one language then you will always use only a part of the 
index which is to some extent a performance penalty (depends on the size 
of the index). Another disadvantage is that the schema configuration can 
get a bit messy - since everything is in one index, for each field and 
field type you'll probably need to define different versions for 
different languages (each one with a different language specific 
analyzer), so for example, if you have a "title" fields, you'll probably 
need to define "title_en" (for English content) an "title_zh" (for 
Chinese content), then you will also need to make sure that when you 
index the content, you send the right fields to Solr (although, you can 
perhaps create a clever update processor that updates the field names 
based on the language field).

Option 2: have separate Solr core for each language. Advantages: Well, 
as opposed to Option 1, here you have smaller indexes, where each is 
dedicated to one language. If the corpus is very big you can have 
performance gains here. Since we are talking about different indexes 
here, each core has its own simple and clean schema (no need for 
multiple fields and field types). Disadvantage: The main one is that you 
cannot perform cross language search. You also need to remember to use 
the right Solr core when indexing & querying.

> 2) I posted some chinese docs to the server. The query of my chinese
> word does not return any result. This happens to my arabic docs too.
> What filter should I look at for this type of problem. Thanks a lot!
>   
Sorry, I don't have experience with Arabic or Chinese languages so I 
don't know of any good analyzers for them.

Cheers,
Uri
> Hi,
>
> I have two questions.
>
> 1) Can solr be configured so all my english docs will be saved in a
> group, say group-en? My chinese docs will be saved in group-cn. So my
> search will only be conducted in the intended group, instead of
> everywhere.
>
> 2) I posted some chinese docs to the server. The query of my chinese
> word does not return any result. This happens to my arabic docs too.
> What filter should I look at for this type of problem. Thanks a lot!
>
> Elaine
>
>   

Mime
View raw message