lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingrambook.com>
Subject RE: Spell Checking a multi word phrase
Date Mon, 17 Jan 2011 19:19:02 GMT
Camden,

You may also want to be aware that there is a new feature added to Spell Check's "collate"
functionality that will guarantee the collations will return hits.  It also is able to return
more than one collation and tell you how many hits each one would result in if re-queried.
 This might do the same thing you're trying to do using shingles, but with more accuracy and
less work.

For info, look at "spellcheck.collate", "spellcheck.maxCollations", "spellcheck.maxCollationTries"
& spellcheck.collateExtendedResults" on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Camden Daily [mailto:camden@jaunter.com] 
Sent: Monday, January 17, 2011 1:01 PM
To: solr-user@lucene.apache.org
Subject: Spell Checking a multi word phrase

Hello all,

I'm pretty new to Solr, and trying to set up a spell checker that can handle
entire phrases.  My goal would be to have something that could offer a
suggestion of "united states" for a query of "untied stats".

I have a very large index, and I've worked a bit with creating shingles for
the spelling index.  The problem I'm running into now is that the
SpellCheckComponent is always tokenizing the query that I pass to it.

For example, a query like this
http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on

The debug information shows me that the parsed query is:
PhraseQuery(text:"untied stats")

But I receive the spelling suggestions for "untied" and "stats" separately.
>From what I understand, this is not a case where I would want to collate; I
simply want the entire phrase treated as one token.

I found the following post after much searching that suggests setting up a
custom QueryConverter:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E

Does anyone know if that would be required?  I had hoped to avoid Java code
entirely with Solr (I haven't used Java in a very long time), but if I do
need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
able to give me some tips of exactly how I would add that functionality to
Solr?

Relevant configs below:

solrconfig.xml:

  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spellShingle</str>
      <str name="spellcheckIndexDir">./spellShingle</str>
      <str name="queryAnalyzerFieldType">textSpellShingle</str>
      <str name="buildOnOptimize">true</str>
    </lst>
</searchComponent>

schema.xml:

    <fieldType name="textSpellShingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

(I had thought setting the KeywordTokenizer for the query analyzer would
keep it from being tokenized, but it doesn't seem to make any difference)

-Camden Daily

Mime
View raw message