lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George Aroush" <geo...@aroush.net>
Subject RE: schema.xml for CJK, German, French, etc.
Date Thu, 03 Jul 2008 01:57:01 GMT
Thanks Erik!

Trouble is, I don't know those languages to conclude that my setup is
correct, specially for CJK.

It's less problematic for European languages, but then again, should I be
using those English filters with the German SnowballPorterFilterFactory?
That is, will WordDelimiterFilterFactory work with a German filter?  Etc.

It would be nice if folks share their setting (Generic for each language)
and then we can add them to a Solr Wiki.

-- George

> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
> Sent: Wednesday, July 02, 2008 9:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: schema.xml for CJK, German, French, etc.
> 
> 
> On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
> > Has anyone created schema.xml for languages other then English?
> 
> Indeed.
> 
> >  I like to
> > see a working example mainly for CJK, German and French.  
> If you have 
> > can you share them?
> >
> > TO get me started, I created the following for German:
> >
> >  <fieldtype name="myfieldtype" class="solr.TextField">
> >    <analyzer>
> >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >      <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> >      <filter class="solr.WordDelimiterFilterFactory"  
> > generateWordParts="0"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/>
> >      <filter class="solr.LowerCaseFilterFactory"/>
> >      <filter class="solr.SnowballPorterFilterFactory"  
> > language="German" />
> >      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >    </analyzer>
> >  </fieldtype>
> >
> > Will those filters work on German text?
> 
> 
> One tip that will help is visiting 
> http://localhost:8983/solr/admin/analysis.jsp
>   and test it out to see that you're getting the tokenization 
> that you desire on some sample text.  Solr's analysis 
> introspection is quite nice and easy to tinker with.
> 
> Removing stop words before lower casing won't quite work 
> though, as StopFilter is case-sensitive with all stop words 
> generally lowercased, but other than relocating the 
> StopFilterFactory in that chain it seems reasonable.
> 
> As always, though, it depends on what you want to do with 
> these languages to offer more concrete recommendations.
> 
> 	Erik
> 


Mime
View raw message