lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Alheiros <>
Subject Re: Problem with Russian stemmer in Solr 1.2
Date Tue, 10 Jul 2007 10:27:20 GMT
Hi Andrew

Yes, I saw that. As I'm not knowledgeable in Russian I had to infer it was
adequate. But as you have much more to add to it, it could be interesting if
you could contribute that.

The problem is Russian analyzer and it's filters are all final class, don't
allowing an elegant extension. But you can create an analyzer that reuse
what is interesting for you (in this case, the stemmer) and customize the
other filters. I would propose you to do that creating the Solr factories so
you can point to your files containing your stopwords. Any chance you could
contribute with this stopwords list?

One of my reasons to not use directly the RussianAnalyzer was that I need to
use an WhitespaceTokenizer removing HTML code... So I created my factories.


On 9/7/07 19:36, "Andrew Stromnov" <> wrote:

> Hi, Daniel
> Stemmer in RussianAnalyser works as expected. But this analyser doesn't
> allow any Solr customization. All stopwords are hardcoded, no support for
> custom tokenizer, no synonym support.
> RussianAnalyser is similar to this scheme:
>   standard tokenizer
>   standard filter factory
>   word delimeter filter factory
>   lowercase filter factory
>   stop filter factory (with hardcoded stopwords)
>   russian stem filter
> Regards,
> Andrew
> Daniel Alheiros wrote:
>> Hi Andrew
>> In fact I did it creating all the Factories for Solr, but I think you can
>> use it directly, changing your index like this:
>> <fieldtype name="cpstext_russian" class="solr.TextField"
>> positionIncrementGap="100">
>>         <analyzer type="index"
>> class=²²>
>>         </analyzer>
>>         <analyzer type="query"
>> class=²²>
>>         </analyzer>
>> </fieldtype>
>> I¹ve not tested that, but I saw something like this.
>> Please tell me if it works as expected and if it solves your problem (I¹m
>> indexing Russian content and as you seem to be knowledgeable of Russian
>> language your comments are very useful).
>> Regards,
>> Daniel
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify
the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

View raw message