lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Mathieu <bmath...@universcine.com>
Subject Alphabetical index for faceting
Date Wed, 03 Jun 2009 10:48:28 GMT
Hello,

My goal is to get an index for alphabetical faceting of titles. For this I'm
trying to define a fieldType meant to index first letter of text, with
stopwords removed. My problem is that without WordDelimiterFilterFactory
stopwords are not removed, and with it I end up with 2 tokens (and I'd like
to keep just the first one).

For example, the string "The Curse of Monkey Island" should be indexed as
"c".

Here is my field type definition as of now:

    <fieldType name="alphabetical" class="solr.TextField"
sortMissingLast="true"
               omitNorms="true">

      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_fr.txt"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([0-9a-z]).*" replacement="$1" replace="all" />
      </analyzer>

    </fieldType>

With my example it gives with 3 tokens: "c", "m", "i".

I have not been able to find any documentation related to what I want to do
(wrong keywords in google?). At this point I'm beginning to think that I
will have to write a custom filter that would replace the
patternreplacefilterfactory: it would keep the first character of the first
token and discard everything else. Unfortunatly I have not programmed with
java for years, so I try to avoid that solution if possible.

And since I don't see my need as something as uncommon, I am wondering what
I am missing. Any idea?

-- 
Bertrand Mathieu

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message