lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wagner,Harry" <wagn...@oclc.org>
Subject RE: Solr and KStem
Date Tue, 11 Sep 2007 12:52:33 GMT
Bill,
Currently it is a plug-in.  Put the lower case filter ahead of kstem,
just as for porter (example below).  You can use it with porter, but I
can't imagine why you would want to.  At least not in the same analyzer.
Hope this helps.

<fieldtype name="text_kstem" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
	<filter class="org.oclc.solr.analysis.KStemFilterFactory"
cacheSize="20000"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="org.oclc.solr.analysis.KStemFilterFactory"
cacheSize="20000"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

Cheers... harry

-----Original Message-----
From: Bill Fowler [mailto:wwfowler@gmail.com] 
Sent: Monday, September 10, 2007 8:33 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and KStem

Hello,

I would like to test this and have a few questions (please excuse what
may
seem naive questions).

I would like to verify that this is purely a configuration feature --
since
the schema.xml defines the analysis/tokerizer chain no other changes are
required.  Also, the source seems to say that a lower case factory needs
to
be "farther down" the tokenizer chain.  So does this mean that the KStem
factory appears before the lower case filter factory in the schema.xml.
Is
there a recommended (required?) tokenizer factory.  I am using the
WhiteSpaceFactory which seems OK.  Finally, I take it that I need to
remove
the EnglishPorterFilterFactory item in the schema.xml -- or no?

Thanks,

Bill



On 9/10/07, Wagner,Harry <wagnerh@oclc.org> wrote:
>
> Hi Yonik,
> The modified KStemmer source is attached. The original KStemFilter is
> now wrapped (and replaced) by KStemFilterFactory.  I also changed the
> path to avoid any naming collisions with existing Lucene code.
>
> I included the jar file also, for anyone who wants to just drop and
> play:
>
> - put KStem2.jar in your solr/lib directory.
> - change your schema to use: <filter
> class="org.oclc.solr.analysis.KStemFilterFactory" cacheSize="20000"/>
> - restart your app server
>
> I don't know if you credit contributions, but if so please include
OCLC.
> Seems only fair since I did this on their dime :)
>
> Cheers!
> harry
>
>
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Friday, September 07, 2007 3:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr and KStem
>
> On 9/7/07, Wagner,Harry <wagnerh@oclc.org> wrote:
> > I've implemented a Solr plug-in that wraps KStem for Solr use.
KStem
> is
> > considered to be more appropriate for library usage since it is much
> > less aggressive than Porter (i.e., searches for organization do NOT
> > match on organ!). If there is any interest in feeding this back into
> > Solr I would be happy to contribute it.
>
> Absolutely.
> We need to make sure that the license for that k-stemmer is ASL
> compatible of course.
>
> -Yonik
>
>

Mime
View raw message