lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: schema.xml for CJK, German, French, etc.
Date Thu, 03 Jul 2008 01:40:22 GMT

On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
> Has anyone created schema.xml for languages other then English?

Indeed.

>  I like to
> see a working example mainly for CJK, German and French.  If you  
> have can
> you share them?
>
> TO get me started, I created the following for German:
>
>  <fieldtype name="myfieldtype" class="solr.TextField">
>    <analyzer>
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>      <filter class="solr.WordDelimiterFilterFactory"  
> generateWordParts="0"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.SnowballPorterFilterFactory"  
> language="German" />
>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    </analyzer>
>  </fieldtype>
>
> Will those filters work on German text?


One tip that will help is visiting http://localhost:8983/solr/admin/analysis.jsp 
  and test it out to see that you're getting the tokenization that you  
desire on some sample text.  Solr's analysis introspection is quite  
nice and easy to tinker with.

Removing stop words before lower casing won't quite work though, as  
StopFilter is case-sensitive with all stop words generally lowercased,  
but other than relocating the StopFilterFactory in that chain it seems  
reasonable.

As always, though, it depends on what you want to do with these  
languages to offer more concrete recommendations.

	Erik


Mime
View raw message