lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TK <kuros...@sonic.net>
Subject Re: Implementing custom analyzer for multi-language stemming
Date Tue, 05 Aug 2014 04:10:43 GMT
On 7/30/14, 10:47 AM, Eugene wrote:
>      Hello, fellow Solr and Lucene users and developers!
>
>      In our project we receive text from users in different languages. We
> detect language automatically and use Google Translate APIs a lot (so
> having arbitrary number of languages in our system doesn't concern us).
> However we need to be able to search using stemming. Having nearly hundred
> of fields (several fields for each language with language-specific
> stemmers) listed in our search query is not an option. So we need a way to
> have a single index which has stemmed tokens for different languages.

Do you mean to have a Tokenizer that switches among supported languages
depending on the "lang" field? This is something I thought about when I
started working on Solr/Lucene and soon I realized it is not possible because
of the way Lucene is designed; The Tokenizer in an analyzer chain cannot peek
other field's value, or there is no way to control which field is processed
first.

If that's not what you are trying to achieve, could you tell us what
it is? If you have different language text in a single field, and if
someone search for a word common to many languages,
such as "sports" (or "Lucene" for that matter), Solr will return
the documents of different languages, most of which the user
doesn't understand. Would that be useful? If you have
a special use case, would you like to share it?

-- 
Kuro

Mime
View raw message