lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingrambook.com>
Subject RE: Spell Checking a multi word phrase
Date Mon, 17 Jan 2011 20:54:51 GMT
Camden,

Have you seen Smiley&Pugh's Solr book?  They describe something very similar to what you're
trying to do on p180ff.  The difference seems to be they use a field that only has a couple
of terms so they don't bother with shingles.  The book makes a big point about using "spellcheck.q"
in this case in order to get the analysis right.  I'm not sure if this is the solution but
I thought I'd mention it.  I never tried spell checking this way because it seemed very limited
and possibly quite expensive. 

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Camden Daily [mailto:camden@jaunter.com] 
Sent: Monday, January 17, 2011 1:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Spell Checking a multi word phrase

James,

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
"stephenie meyer".  "stephenie" is a far less popular spelling than
"stephanie", but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <James.Dyer@ingrambook.com>wrote:

> Camden,
>
> You may also want to be aware that there is a new feature added to Spell
> Check's "collate" functionality that will guarantee the collations will
> return hits.  It also is able to return more than one collation and tell you
> how many hits each one would result in if re-queried.  This might do the
> same thing you're trying to do using shingles, but with more accuracy and
> less work.
>
> For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on the
> component's wiki page:
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
>
> This feature is committed to 3.x and 4.x and is available as a patch for
> 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Camden Daily [mailto:camden@jaunter.com]
> Sent: Monday, January 17, 2011 1:01 PM
> To: solr-user@lucene.apache.org
> Subject: Spell Checking a multi word phrase
>
> Hello all,
>
> I'm pretty new to Solr, and trying to set up a spell checker that can
> handle
> entire phrases.  My goal would be to have something that could offer a
> suggestion of "united states" for a query of "untied stats".
>
> I have a very large index, and I've worked a bit with creating shingles for
> the spelling index.  The problem I'm running into now is that the
> SpellCheckComponent is always tokenizing the query that I pass to it.
>
> For example, a query like this
>
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
>
> The debug information shows me that the parsed query is:
> PhraseQuery(text:"untied stats")
>
> But I receive the spelling suggestions for "untied" and "stats" separately.
> From what I understand, this is not a case where I would want to collate; I
> simply want the entire phrase treated as one token.
>
> I found the following post after much searching that suggests setting up a
> custom QueryConverter:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E
>
> Does anyone know if that would be required?  I had hoped to avoid Java code
> entirely with Solr (I haven't used Java in a very long time), but if I do
> need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> able to give me some tips of exactly how I would add that functionality to
> Solr?
>
> Relevant configs below:
>
> solrconfig.xml:
>
>  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>    <lst name="spellchecker">
>      <str name="name">default</str>
>      <str name="field">spellShingle</str>
>      <str name="spellcheckIndexDir">./spellShingle</str>
>      <str name="queryAnalyzerFieldType">textSpellShingle</str>
>      <str name="buildOnOptimize">true</str>
>    </lst>
> </searchComponent>
>
> schema.xml:
>
>    <fieldType name="textSpellShingle" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="true"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> (I had thought setting the KeywordTokenizer for the query analyzer would
> keep it from being tokenized, but it doesn't seem to make any difference)
>
> -Camden Daily
>

Mime
View raw message