lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanna Ranganathan <>
Subject Re: Question about PatternReplace filter and automatic Synonym generation
Date Tue, 06 Oct 2009 00:31:26 GMT

On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <> wrote:

>> Alternatively, is there a filter available which takes in a pattern and
>> produces additional forms of the token depending on the pattern? The use
>> case I am looking at here is using such a filter to automate synonym
>> generation. In our application, quite a few of the synonym file entries
>> match a specific pattern and having such a filter would make it easier I
>> believe. Pl. do correct me in case I am missing some unwanted side-effect
>> with this approach.
> I do not understand this. TokenFilters are used for things like stemming,
> replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
> additional tokens (synonyms) from a file for each token.
> What exactly are you trying to do with synonyms? I guess you could do
> stemming etc with synonyms but why do you want to do that?
 I ll try to explain with an example. Given the term 'it!' in the title, it
should match both 'it' and 'it!' in the query as an exact match. Currently,
this is done by using a synonym entry  (and index time SynonymFilter) as

 it! => it, it!

 Now, the above holds true for all cases where you have a title token of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

 I am hoping to do the same by using a index time filter that takes in a
pattern like the PatternReplace filter and adds the newly created token
instead of replacing the original one. Does this make sense? Am I missing
something that would break this approach?

> Note that a change in synonym file needs a re-index of the affected
> documents. Also, the synonym map is kept in memory.

 What is the overhead incurred in having an additional filter applied during
indexing? It is strictly CPU only?

 Thanks a lot for your valuable input.



View raw message