lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Seltzer" <dselt...@TVEyes.com>
Subject RE: Stripping Punctuation in a fieldType
Date Fri, 15 Jan 2010 19:51:56 GMT
Does anyone out there know how to use PatternReplaceCharFilterFactory?

The closest think to an example I see is in the default schema.xml:
<!--
 The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of
characters
             matching a pattern with an arbitrary replacement string, 
             which may include back references to portions of the
original
             string matched by the pattern.
             
             See the Java Regular Expression documentation for more
             information on pattern and replacement string syntax.
             
 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.
html
          
-->
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
replacement="" replace="all"/>

I'm not sure how the PatternReplaceCharFilterFactory differs from the
PatternReplaceFilterFactory. Can anyone give me an example of how to
strip all commas for example using this technique?

Thanks!

-Dave

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, January 15, 2010 2:32 PM
To: solr-user@lucene.apache.org
Subject: Re: Stripping Punctuation in a fieldType

Ah, ok, your approach makes sense. Mostly I was trying
to insure that you weren't flying blind.

Perhaps you would find some joy with
PatternReplaceCharFilterFactory, replacing
all non-alphanum with empty string?

HTH
Erick

On Fri, Jan 15, 2010 at 2:07 PM, David Seltzer <dseltzer@tveyes.com>
wrote:

> Hi Erik,
>
> Thanks for your thoughtful reply!
>
> > It's actually quite rare for simple tokenizers like these to be
> satisfactory
> > unless it's a field you can guarantee is indexed/searched in a very
> > controlled manner, say part numbers or words from a list. In your
> > example above, none of the three variants would get a hit if the
> > user searched for "nation". Is that what you want?
>
> Yes, this is what I want. The reason for this behavior is that the
> output of SOLR needs to closely match the search results provided by a
> different legacy system. Our user have rigidly defined queries. A user
> who was interested in "nation's" is required either to search for
> "nations" or "nation*".
>
> > But no, Standard* don't have any stemming built in. And
> > what do you mean by "language specific functionality"?
> > They do NOT fold accents for instance if that's what
> > you're getting at.
>
> I asked that because I'm not super comfortable I know what's going on
> under the hood inside these tokenizers. Do they work the same on
> RightToLeft languages (such as Arabic) as they do in LeftToRight
> languages? (My assumption regarding the WhiteSpaceTokenizer is that it
> would be very language/direction neutral)
>
> > Could you explain a bit about *why* you want this behavior?
> In short we have to support multiple languages and match the behavior
of
> an existing non-solr system.
>
> -Dave
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, January 15, 2010 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Stripping Punctuation in a fieldType
>
> If you haven't seen it, this page is invaluable for this kind of
> question:
>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT
> okenizerFactory
>
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter
> TokenizerFactory>
>
> LetterTokenizerFactory may well be your friend here, followed by
> LowerCaserFilterFactory. There is a problem that it would
> split "nation's" up into "nation" and "s", so searching on "nations"
> wouldn't get a hit.
>
> But you have equally ugly stuff with WhiteSpaceTokenizerFactory
> as you're finding out.
>
> It's actually quite rare for simple tokenizers like these to be
> satisfactory
> unless it's a field you can guarantee is indexed/searched in a very
> controlled manner, say part numbers or words from a list. In your
> example above, none of the three variants would get a hit if the
> user searched for "nation". Is that what you want?
>
> But no, Standard* don't have any stemming built in. And
> what do you mean by "language specific functionality"?
> They do NOT fold accents for instance if that's what
> you're getting at.
>
> Could you explain a bit about *why* you want this behavior?
>
> HTH
> Erick
>
> On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dseltzer@tveyes.com>
> wrote:
>
> > I'm hesitant to change Tokenizers at the moment because what we have
> is
> > working so nicely - or so I thought.
> >
> > What I'm looking for is case-insensitive search for words and
numbers
> > without any of the stemming features turned on. The new requirement
is
> > that we take punctuation out of the mix.
> >
> > Right now when I search for "Obama" I'm not getting any hits on
> "Obama."
> >
> > So I'm basically looking to strip punctuation. The consequence would
> be
> > that "nation's", "nations" and "nations," would all be represented
the
> > same way.
> >
> > Would the StandardTokenizerFactory accomplish this?
> > Does it have any language specific functionality?
> > Does it do anything with stemming?
> >
> > Thanks for everyone's input!
> >
> > -Dave
> >
> >
> >
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> > Sent: Friday, January 15, 2010 12:42 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Stripping Punctuation in a fieldType
> >
> > > I'm trying to find the best way to set up a fieldType that
> > > strips punctuation.
> >
> > Use solr.StandardTokenizerFactory that strips punctuations.
> >
> > Or if you do not care about alphanumeric or numeric queries use
> > solr.LowerCaseTokenizerFactory that uses LetterTokenizer.
> >
> > I think the right way to do this is using a
> > > CharacterFilter
> > > of some type, but I can't seem to find any examples of how
> > > to set this
> > > up in a schema.xml file.
> >
> > If you want to use solr.MappingCharFilterFactory you need to write
all
> > punctiation characters to a text file manually. e.g. "," => ""
> >
> >
> >
> >
>

Mime
View raw message