lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Question about solr.WordDelimiterFilterFactory
Date Thu, 12 Apr 2012 12:01:27 GMT
WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu <josephjxu@yahoo.com> wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. Some of these
documents contain decimal points. I am looking for a way to index these documents so that
adjacent numeric characters (such as [0-9.,]) are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital characters. For
example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching for "ab.cd"
should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer called
solr.WordDelimiterFilter that supports a "types" attribute which map special characters as
different delimiters.  However, it isn't exactly what I want. It doesn't provide context
check such as "," or "." must surround by digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  Is writing
my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian

Mime
View raw message