lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Kan <dmitry....@gmail.com>
Subject Re: searching camel cased terms with phrase queries
Date Wed, 07 Nov 2012 19:29:45 GMT
Hi Jack,

It seems all the html tables in my original e-mail have been broken, sorry
about that.

Thanks for the ideas. You're very right that a compound word recognition
would be the way to pursue. Most probably this feature is domain dependent,
so implementing something generic might not satisfy all different
requirements, that is may be why it is not part of off-the-shelf SOLR?
Anyway, since for our use case, the requirement of splitting on case change
was marginal (basically only one customer has asked for this), we decided
to drop the feature altogether.

P.S. The example with PricewaterhouseCoopers had artificial casing only to
illustrate what happens when there is more than two compound parts in a
word. Didn't mean to offend the company :-)

Regards,
Dmitry

On Wed, Nov 7, 2012 at 6:14 PM, Jack Krupansky <jack@basetechnology.com>wrote:

> This is one of those areas of Solr where you can refine and make
> improvements, as you have done, but never actually reach 100% satisfaction.
> And, in some cases, as here, you have a choice of settings and no single
> combination covers all cases.
>
> In this case, you really need compound-term recognition - detecting that
> two or more terms have been juxtaposed with no lexical boundary. Google has
> it, and I 'm sure some Solr users have implemented it on their own, but it
> isn't in Solr proper, yet.
>
> WDF provides a partial approximation, by generating extra, compound terms
> at index time. That works well when ALL of the terms are written together,
> but not when only a subset are written together without lexical boundaries,
> as in your final example.
>
> So, you COULD go the full Google route with a lot of additional effort, or
> accept that you offer only a reasonable approximation. Your choice.
>
> So, pick the approximation which seems "best" and accept that it doesn't
> handle the other cases.
>
> BTW, the proper name is "PricewaterhouseCoopers".
>
> -- Jack Krupansky
>
> -----Original Message----- From: Dmitry Kan
> Sent: Wednesday, November 07, 2012 1:58 AM
> To: solr-user@lucene.apache.org
> Subject: searching camel cased terms with phrase queries
>
>
> Hello list,
>
> There was a number of threads about handling camel cased words apparently
> in the past (
> http://search-lucene.com/?q=**camel+case&fc_project=Lucene&**
> fc_project=Solr<http://search-lucene.com/?q=camel+case&fc_project=Lucene&fc_project=Solr>
> ).
> Our case is somewhat different from them.
>
> ===================
> Configuration & example
> ===================
>
> To illustrate the issue, let me give you a real example from our data.
> Suppose there is a term in the original text: SmartTV.
>
> If a user wants to type "SmartTV" and "smart tv", we want both to hit the
> original term SmartTV. In order to achieve this, the following filter is
> used in our solr 3.4 schema:
>
> index side:
>
>              <filter class="solr.**WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                preserveOriginal="1"
>                spiltOnCaseChange="1"
>              />
>
> query side:
>
>              <filter class="solr.**WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                preserveOriginal="1"
>                spiltOnCaseChange="1"
>              />
>
> (no differences)
>
> Copying from the analysis page, the index will contain the following terms
> and their positions:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 12 term text SmartTVTV Smart startOffset 05 0
> endOffset 77 5 type <ALPHANUM><ALPHANUM> <ALPHANUM>
>
>
> (there are tokenizer StandardTokenizerFactory and StandardFilterFactory
> preceeding this filter, but as they didn't affect in this case, their
> output is skipped).
>
> On the query side the query="smart tv" gets processed like:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 12 term text smarttv startOffset 06 endOffset
> 58
> type <ALPHANUM><ALPHANUM>
>
> so there is a match (of course the LowerCaseFilterFactory is configured to
> follow the WordDelimiterFilterFactory to unify the cases for matching) and
> user is happily shooting queries: 'smart tv', 'smarttv' and 'SmartTV'.
>
> ==============================**=====================
> More complex example that doesn't work with the above configuration
> ==============================**=====================
>
> Problems start to occur, if a user types "smarttv for me" against the text
> "SmartTV for me". Here are the index and query analysis excerpts:
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text SmartTVTVforme Smart startOffset
> 05812 0 endOffset 771114 5 type <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM>
>
> <ALPHANUM>
>
> query:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
> endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
>
> since in the user query smarttv was written in small case, no split on case
> is triggered and we believe there is no match due to mismatch of the term
> positions ('for' is on the 3rd position in the index and on the 2nd
> position in the query and 'smarttv' and 'for' are not adjacent to satisfy
> the phrase query).
>
>
> =========================
> Config change to fix the problem
> =========================
>
>
> But here catenateWords=1 on indexing side comes at rescue. Which changes
> things to:
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text SmartTVTVforme SmartSmartTV
> startOffset 05812 00 endOffset 771114 57 type
> <ALPHANUM><ALPHANUM><ALPHANUM>
>
> <ALPHANUM> <ALPHANUM><ALPHANUM>
> query (copying again for comparison purposes):
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123 term text smarttvforme startOffset 0812
> endOffset 71114 type <ALPHANUM><ALPHANUM><ALPHANUM>
>
>
> now there should be a match, because terms 'smarttv', 'for' and 'me' are
> adjacent in the index (ingoring the case differences as
> LowerCaseFilterFactory unifies them for us) and that is what's required by
> the phrase query "smarttv for me".
>
> ====================
> Problem we couldn't solve
> ====================
>
> As we saw above, catenateWords merges maximum run of compound term parts
> into one and aligns the resulting concatenated term with the last term
> part. Illustration with an artificial camel casing:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234 term text PriceWaterHouseCoopersWaterHou*
> *se
> Coopers PricePriceWaterHouseCoopers startOffset 051015 00 endOffset
> 22101522
> 522 type <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM> <ALPHANUM><ALPHANUM>
>
> The following text and query will not match each other: text='product for
> PriceWaterHouseCoopers company', query="product for PricewaterHouseCoopers
> company":
>
> index:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=1,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 1234567 term text productfor
> PriceWaterHouseCoopersWaterHou**seCooperscompany
> PricePriceWaterHouseCoopers
> startOffset 081217222735 1212 endOffset 7113422273442 1734 type <ALPHANUM>
>
> <ALPHANUM><ALPHANUM><ALPHANUM>**<ALPHANUM><ALPHANUM><ALPHANUM>
<ALPHANUM>
> <ALPHANUM>
> query:
>
> org.apache.solr.analysis.**WordDelimiterFilterFactory {preserveOriginal=1,
> spiltOnCaseChange=1, generateNumberParts=0, catenateWords=0,
> luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0,
> catenateNumbers=0} position 123456 term text productfor
> PricewaterHouseCoopersHouseCoo**perscompany Pricewater startOffset
> 1913232836
> 13 endOffset 81235283543 23 type <ALPHANUM><ALPHANUM><ALPHANUM>**
> <ALPHANUM>
>
> <ALPHANUM><ALPHANUM> <ALPHANUM>
>
> Is there any way to make them match?
>
> Thanks for reading this far.
>
> -dmitry
>



-- 
Regards,

Dmitry Kan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message