lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanna Ranganathan <pranganat...@netflix.com>
Subject Re: Question about PatternReplace filter and automatic Synonym generation
Date Wed, 07 Oct 2009 18:25:18 GMT


On 10/6/09 3:32 PM, "Chris Hostetter" <hossman_lucene@fucit.org> wrote:

> 
> :  I ll try to explain with an example. Given the term 'it!' in the title, it
> : should match both 'it' and 'it!' in the query as an exact match. Currently,
> : this is done by using a synonym entry  (and index time SynonymFilter) as
> : follows:
> : 
> :  it! => it, it!
> : 
> :  Now, the above holds true for all cases where you have a title token of the
> : form [aA-zZ]*!. Handling all of those cases requires adding synonyms
> : manually for each case which is not easy to manage and does not scale.
> : 
> :  I am hoping to do the same by using a index time filter that takes in a
> : pattern like the PatternReplace filter and adds the newly created token
> : instead of replacing the original one. Does this make sense? Am I missing
> : something that would break this approach?
> 
> something like this would be fairly easy to implement in Lucene, but
> somewhat confusing to try and configure in Solr.  I was going to suggest
> that you use something like...
>  <filter class="solr.PatternReplaceFilterFactory"
>                 pattern="(^.*)\!?$)" replacement="$1 $2" replace="all" />
> 
> ..and then have a subsequent filter that splits the tokens on the
> whitespace (or any other special character you could use in the
> replacement) ... but aparently we don't have any built in filters that
> will just split tokens on a character/pattern for you.  that would also be
> fairly easy to write if someone wnats to submit a patch.

 There is a Solr.PatternTokenizerFactory class which likely fits the bill in
this case. The related question I have is this - is it possible to have
multiple Tokenizers in your analysis chain?

Prasanna.


Mime
View raw message