lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Kan <dmitry....@gmail.com>
Subject searching camel cased terms with phrase queries
Date Wed, 07 Nov 2012 09:58:10 GMT
Hello list,

There was a number of threads about handling camel cased words apparently
in the past (
http://search-lucene.com/?q=camel+case&fc_project=Lucene&fc_project=Solr).
Our case is somewhat different from them.

===================
Configuration & example
===================

To illustrate the issue, let me give you a real example from our data.
Suppose there is a term in the original text: SmartTV.

If a user wants to type "SmartTV" and "smart tv", we want both to hit the
original term SmartTV. In order to achieve this, the following filter is
used in our solr 3.4 schema:

index side:

              <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="0"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="1"
                spiltOnCaseChange="1"
              />

query side:

              <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="0"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="1"
                spiltOnCaseChange="1"
              />

(no differences)

Copying from the analysis page, the index will contain the following terms
and their positions:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 12 term text SmartTVTV Smart startOffset 05 0
endOffset 77 5 type <ALPHANUM><ALPHANUM> <ALPHANUM>

(there are tokenizer StandardTokenizerFactory and StandardFilterFactory
preceeding this filter, but as they didn't affect in this case, their
output is skipped).

On the query side the query="smart tv" gets processed like:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 12 term text smarttv startOffset 06 endOffset 58
type <ALPHANUM><ALPHANUM>

so there is a match (of course the LowerCaseFilterFactory is configured to
follow the WordDelimiterFilterFactory to unify the cases for matching) and
user is happily shooting queries: 'smart tv', 'smarttv' and 'SmartTV'.

===================================================
More complex example that doesn't work with the above configuration
===================================================

Problems start to occur, if a user types "smarttv for me" against the text
"SmartTV for me". Here are the index and query analysis excerpts:

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text SmartTVTVforme Smart startOffset
05812 0 endOffset 771114 5 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM>

query:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
since in the user query smarttv was written in small case, no split on case
is triggered and we believe there is no match due to mismatch of the term
positions ('for' is on the 3rd position in the index and on the 2nd
position in the query and 'smarttv' and 'for' are not adjacent to satisfy
the phrase query).


=========================
Config change to fix the problem
=========================


But here catenateWords=1 on indexing side comes at rescue. Which changes
things to:

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text SmartTVTVforme SmartSmartTV
startOffset 05812 00 endOffset 771114 57 type <ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM> <ALPHANUM><ALPHANUM>
query (copying again for comparison purposes):

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>

now there should be a match, because terms 'smarttv', 'for' and 'me' are
adjacent in the index (ingoring the case differences as
LowerCaseFilterFactory unifies them for us) and that is what's required by
the phrase query "smarttv for me".

====================
Problem we couldn't solve
====================

As we saw above, catenateWords merges maximum run of compound term parts
into one and aligns the resulting concatenated term with the last term
part. Illustration with an artificial camel casing:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234 term text PriceWaterHouseCoopersWaterHouse
Coopers PricePriceWaterHouseCoopers startOffset 051015 00 endOffset 22101522
522 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM> <ALPHANUM><ALPHANUM>
The following text and query will not match each other: text='product for
PriceWaterHouseCoopers company', query="product for PricewaterHouseCoopers
company":

index:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 1234567 term text productfor
PriceWaterHouseCoopersWaterHouseCooperscompany PricePriceWaterHouseCoopers
startOffset 081217222735 1212 endOffset 7113422273442 1734 type <ALPHANUM>
<ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM>
<ALPHANUM>
query:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1,
spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
catenateNumbers=0} position 123456 term text productfor
PricewaterHouseCoopersHouseCooperscompany Pricewater startOffset 1913232836
13 endOffset 81235283543 23 type <ALPHANUM><ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM><ALPHANUM> <ALPHANUM>

Is there any way to make them match?

Thanks for reading this far.

-dmitry

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message