lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Satish Kumar <satish.kumar.just.d...@gmail.com>
Subject Re: anti-words - exact match
Date Mon, 09 Aug 2010 21:11:50 GMT
Thanks Jon.

My initial thought was exactly like yours. My preference was to implement
this requirement completely at Solr level so that different applications
won't have to put this logic. However, I am not sure how to shingle-ize the
input query and use that in filter query with a NOT operator at the solr
layer. The other option as you suggested is to single-ize the input query in
the application layer -- this is doable, but means adding logic in
application layer.

For now I am settling on the below solution:

- each anti-word (can be multiple words) will be stored as separate token.
The input record will contain different anti-word separated by
comma. solr.PatternTokenizerFactory will be used to split on comma and
create tokens

- the list of anti-words is stored in memory in application layer and
anti-words are extracted from the user entered query (e.g. if user enteres
'I have swollen foot' and 'swollen foot' is anti-word, swollen foot is
extracted)

- filter query with NOT operator on anti-word field is sent to solr


Thanks much!

Satish

This is tricky. You could try doing something with the ShingleFilter (
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory)
> at _query time_ to turn the users query:
>
> "i have a swollen foot" into:
> "i", "i have", "i have a", "i have a swollen", .... "have", "have a", "have
> a swollen"... etc.
>
> I _think_ you can get the ShingleFilter factory to do that.
>
> But now you only want to exclude if one of those shingles matches the
> ENTIRE "anti-word". So maybe index as non-tokenized, so each of those
> shingles will somehow only match on the complete thing.  You'd want to
> normalize spacing and punctuation.
>
> But then you need to turn that into a _negated_ element of your query.
> Perhaps by using an fq with a NOT/"-" in it? And a query which 'matches'
> (causing 'not' behavior) if _any_ of the shingles match.
>
> I have no idea if it's actually possible to put these things together in
> that way. A non-tokenized field? Which still has it's queries shingle-ized
> at query time? And then works as a negated query, matching for negation if
> any of the shingles match?  Not really sure how to put that together in your
> solrconfig.xml and/or application logic if needed. You could try.
>

yup, I didn't know how to shingle-ized the input query and use that as input
in filter query.


> Another option would be doing the query-time 'shingling' in your app, and
> then it's a somewhat more normal Solr query. &fq= -"shingle one" -"shingle
> two" -"shingle three" etc.  Or put em in seperate fq's depending on how you
> want to use your filter cache. Still searching on a non-tokenized field, and
> still normalizing on white-space and punctuation at both index time and
> (using same normalization logic but in your application logic this time)
> query time.  I think that might work.
>
> So I'm not really sure, but maybe that gives you some ideas.
>
> Jonathan
>
>
>
>
> Satish Kumar wrote:
>
>> Hi,
>>
>> We have a requirement to NOT display search results if user query contains
>> terms that are in our anti-words field. For example, if user query is "I
>> have swollen foot" and if some records in our index have "swollen foot" in
>> anti-words field, we don't want to display those records. How do I go
>> about
>> implementing this?
>>
>> NOTE 1: anti-words field can contain multiple values. Each value can be a
>> one or multiple words (e.g. "swollen foot", "headache", etc. )
>>
>> NOTE 2: the match must be exact. If anti-words field contains "swollen
>> foot"
>> and if user query is "I have swollen foot", record must be excluded. If
>> user
>> query is "My foot is swollen", the record should not be excluded.
>>
>> Any pointers is greatly appreciated!
>>
>>
>> Thanks,
>> Satish
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message