lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elaine Li <elaine.bing...@gmail.com>
Subject Re: multi-language search
Date Tue, 25 Aug 2009 14:34:18 GMT
Uri,

Thanks a lot! I don't need to do cross language search. So Option 2
sounds better, coz my corpus is very large.

I am still looking for help on chinese language search. I tried
chinesetokenizerfactory as my analyzer, but it did not help. Only word
with white space, comma and etc around them can be found.

Elaine

On Mon, Aug 24, 2009 at 6:01 PM, Uri Boness<uboness@gmail.com> wrote:
> I can think of ways to tackle your problem:
>
> Option 1: each document will have a field indicating its language. Then,
> when searching, you can simply filter the query on the language you're
> searching on. Advantages: everything is in one index, so if in the future
> you will need to do a cross language search you'll be able to do that
> without changing anything. Disadvantages: Well, depending on how your data
> is structured, your index can grow big - now if you always search only on
> one language then you will always use only a part of the index which is to
> some extent a performance penalty (depends on the size of the index).
> Another disadvantage is that the schema configuration can get a bit messy -
> since everything is in one index, for each field and field type you'll
> probably need to define different versions for different languages (each one
> with a different language specific analyzer), so for example, if you have a
> "title" fields, you'll probably need to define "title_en" (for English
> content) an "title_zh" (for Chinese content), then you will also need to
> make sure that when you index the content, you send the right fields to Solr
> (although, you can perhaps create a clever update processor that updates the
> field names based on the language field).
>
> Option 2: have separate Solr core for each language. Advantages: Well, as
> opposed to Option 1, here you have smaller indexes, where each is dedicated
> to one language. If the corpus is very big you can have performance gains
> here. Since we are talking about different indexes here, each core has its
> own simple and clean schema (no need for multiple fields and field types).
> Disadvantage: The main one is that you cannot perform cross language search.
> You also need to remember to use the right Solr core when indexing &
> querying.
>
>> 2) I posted some chinese docs to the server. The query of my chinese
>> word does not return any result. This happens to my arabic docs too.
>> What filter should I look at for this type of problem. Thanks a lot!
>>
>
> Sorry, I don't have experience with Arabic or Chinese languages so I don't
> know of any good analyzers for them.
>
> Cheers,
> Uri
>>
>> Hi,
>>
>> I have two questions.
>>
>> 1) Can solr be configured so all my english docs will be saved in a
>> group, say group-en? My chinese docs will be saved in group-cn. So my
>> search will only be conducted in the intended group, instead of
>> everywhere.
>>
>> 2) I posted some chinese docs to the server. The query of my chinese
>> word does not return any result. This happens to my arabic docs too.
>> What filter should I look at for this type of problem. Thanks a lot!
>>
>> Elaine
>>
>>
>

Mime
View raw message