lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Zambrano <czamb...@gmail.com>
Subject Re: Question about PatternReplace filter and automatic Synonym generation
Date Tue, 06 Oct 2009 03:59:24 GMT
Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.

Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan <pranganathan@netflix.com 
 > wrote:

>
>
>
> On 10/5/09 2:46 AM, "Shalin Shekhar Mangar" <shalinmangar@gmail.com>  
> wrote:
>
>>> Alternatively, is there a filter available which takes in a  
>>> pattern and
>>> produces additional forms of the token depending on the pattern?  
>>> The use
>>> case I am looking at here is using such a filter to automate synonym
>>> generation. In our application, quite a few of the synonym file  
>>> entries
>>> match a specific pattern and having such a filter would make it  
>>> easier I
>>> believe. Pl. do correct me in case I am missing some unwanted side- 
>>> effect
>>> with this approach.
>>>
>>>
>> I do not understand this. TokenFilters are used for things like  
>> stemming,
>> replacing patterns, lowercasing, n-gramming etc. The synonym filter  
>> inserts
>> additional tokens (synonyms) from a file for each token.
>>
>> What exactly are you trying to do with synonyms? I guess you could do
>> stemming etc with synonyms but why do you want to do that?
>
> I ll try to explain with an example. Given the term 'it!' in the  
> title, it
> should match both 'it' and 'it!' in the query as an exact match.  
> Currently,
> this is done by using a synonym entry  (and index time  
> SynonymFilter) as
> follows:
>
> it! => it, it!
>
> Now, the above holds true for all cases where you have a title token  
> of the
> form [aA-zZ]*!. Handling all of those cases requires adding synonyms
> manually for each case which is not easy to manage and does not scale.
>
> I am hoping to do the same by using a index time filter that takes  
> in a
> pattern like the PatternReplace filter and adds the newly created  
> token
> instead of replacing the original one. Does this make sense? Am I  
> missing
> something that would break this approach?
>
>>
>> Note that a change in synonym file needs a re-index of the affected
>> documents. Also, the synonym map is kept in memory.
>
> What is the overhead incurred in having an additional filter applied  
> during
> indexing? It is strictly CPU only?
>
> Thanks a lot for your valuable input.
>
> Regards,
>
> Prasanna.
>

Mime
View raw message