lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Brügge <daniel.brue...@googlemail.com>
Subject Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Date Thu, 08 Nov 2012 10:07:56 GMT
When I look at the text_de fieldType provided in the example schema i can
see:

>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/>
>         <filter class="solr.GermanNormalizationFilterFactory"/>
>         <filter class="solr.GermanLightStemFilterFactory"/>


I have tried with this and this removed the words with Umlaute. It seems,
that is because of format="snowball". I haven't used this, because I though
I had one word per line. But maybe some invisible characters got into my
stopword file and destroyed it.

Thanks.

Daniel

On Thu, Nov 8, 2012 at 10:36 AM, Daniel Brügge <
daniel.bruegge@googlemail.com> wrote:

> Yes, I did this and the Words with the Umlaute went through the
> Stopfilter. The ones without Umlaute were correctly removed.
>
> On Thu, Nov 8, 2012 at 2:22 AM, Lance Norskog <goksron@gmail.com> wrote:
>
>> You can debug this with the 'Analysis' page in the Solr UI. You pick
>> 'text_general' and then give words with umlauts in the text box for
>> indexing and queries.
>>
>> Lance
>>
>> ----- Original Message -----
>> | From: "Daniel Brügge" <daniel.bruegge@googlemail.com>
>> | To: solr-user@lucene.apache.org
>> | Sent: Wednesday, November 7, 2012 8:45:45 AM
>> | Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other
>> special characters
>> |
>> | Hi,
>> |
>> | i am running a SolrCloud cluster with the 4.0.0 version. I have a
>> | stopwords
>> | file
>> | which is in the correct encoding. It contains german Umlaute like
>> | e.g. 'ü'.
>> | I am
>> | also running a standalone Zookeeper which contains this stopwords
>> | file. In
>> | my schema
>> | i am using the stopwords file in the standard way:
>> |
>> | >
>> | >     <fieldType name="text_general" class="solr.TextField"
>> | > positionIncrementGap="100">
>> | >       <analyzer type="index">
>> | >                 <tokenizer class="solr.StandardTokenizerFactory"/>
>> | >                 <filter class="solr.StopFilterFactory"
>> | >                                 ignoreCase="true"
>> | >                                 words="my_stopwords.txt"
>> | >                                 enablePositionIncrements="true" />
>> |
>> |
>> | When I am indexing i recognized, that all stopwords without Umlaute
>> | are
>> | correctly removed, but the ones with
>> | Umlaute still exist.
>> |
>> | Is this a problem with ZK or Solr?
>> |
>> | Thanks & regards
>> |
>> | Daniel
>> |
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message