lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camden Daily <>
Subject Re: Spell Checking a multi word phrase
Date Mon, 17 Jan 2011 19:41:13 GMT

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
"stephenie meyer".  "stephenie" is a far less popular spelling than
"stephanie", but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <>wrote:

> Camden,
> You may also want to be aware that there is a new feature added to Spell
> Check's "collate" functionality that will guarantee the collations will
> return hits.  It also is able to return more than one collation and tell you
> how many hits each one would result in if re-queried.  This might do the
> same thing you're trying to do using shingles, but with more accuracy and
> less work.
> For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the
> component's wiki page:
> This feature is committed to 3.x and 4.x and is available as a patch for
> 1.4.1 (here:
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> -----Original Message-----
> From: Camden Daily []
> Sent: Monday, January 17, 2011 1:01 PM
> To:
> Subject: Spell Checking a multi word phrase
> Hello all,
> I'm pretty new to Solr, and trying to set up a spell checker that can
> handle
> entire phrases.  My goal would be to have something that could offer a
> suggestion of "united states" for a query of "untied stats".
> I have a very large index, and I've worked a bit with creating shingles for
> the spelling index.  The problem I'm running into now is that the
> SpellCheckComponent is always tokenizing the query that I pass to it.
> For example, a query like this
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
> The debug information shows me that the parsed query is:
> PhraseQuery(text:"untied stats")
> But I receive the spelling suggestions for "untied" and "stats" separately.
> From what I understand, this is not a case where I would want to collate; I
> simply want the entire phrase treated as one token.
> I found the following post after much searching that suggests setting up a
> custom QueryConverter:
> Does anyone know if that would be required?  I had hoped to avoid Java code
> entirely with Solr (I haven't used Java in a very long time), but if I do
> need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> able to give me some tips of exactly how I would add that functionality to
> Solr?
> Relevant configs below:
> solrconfig.xml:
>  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>    <lst name="spellchecker">
>      <str name="name">default</str>
>      <str name="field">spellShingle</str>
>      <str name="spellcheckIndexDir">./spellShingle</str>
>      <str name="queryAnalyzerFieldType">textSpellShingle</str>
>      <str name="buildOnOptimize">true</str>
>    </lst>
> </searchComponent>
> schema.xml:
>    <fieldType name="textSpellShingle" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> (I had thought setting the KeywordTokenizer for the query analyzer would
> keep it from being tokenized, but it doesn't seem to make any difference)
> -Camden Daily

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message