lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camden Daily <cam...@jaunter.com>
Subject Re: Spell Checking a multi word phrase
Date Tue, 18 Jan 2011 00:59:37 GMT
James,

Thanks, the spellcheck.q was exactly what I needed to be using!

-Camden

On Mon, Jan 17, 2011 at 3:54 PM, Dyer, James <James.Dyer@ingrambook.com>wrote:

> Camden,
>
> Have you seen Smiley&Pugh's Solr book?  They describe something very
> similar to what you're trying to do on p180ff.  The difference seems to be
> they use a field that only has a couple of terms so they don't bother with
> shingles.  The book makes a big point about using "spellcheck.q" in this
> case in order to get the analysis right.  I'm not sure if this is the
> solution but I thought I'd mention it.  I never tried spell checking this
> way because it seemed very limited and possibly quite expensive.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Camden Daily [mailto:camden@jaunter.com]
> Sent: Monday, January 17, 2011 1:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Spell Checking a multi word phrase
>
> James,
>
> Thank you, but I'm not sure that will work for my needs.  I'm very
> interested in contextual spell checking.  Take for example the author
> "stephenie meyer".  "stephenie" is a far less popular spelling than
> "stephanie", but in this context it's the correct option.  I feel like
> shingles with an un tokenized query string would be able to catch this, but
> I can't find too many examples of people attempting this.
>
> On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James <James.Dyer@ingrambook.com
> >wrote:
>
> > Camden,
> >
> > You may also want to be aware that there is a new feature added to Spell
> > Check's "collate" functionality that will guarantee the collations will
> > return hits.  It also is able to return more than one collation and tell
> you
> > how many hits each one would result in if re-queried.  This might do the
> > same thing you're trying to do using shingles, but with more accuracy and
> > less work.
> >
> > For info, look at "spellcheck.collate", "spellcheck.maxCollations",
> > "spellcheck.maxCollationTries" & spellcheck.collateExtendedResults" on
> the
> > component's wiki page:
> > http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
> >
> > This feature is committed to 3.x and 4.x and is available as a patch for
> > 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
> >
> > James Dyer
> > E-Commerce Systems
> > Ingram Content Group
> > (615) 213-4311
> >
> >
> > -----Original Message-----
> > From: Camden Daily [mailto:camden@jaunter.com]
> > Sent: Monday, January 17, 2011 1:01 PM
> > To: solr-user@lucene.apache.org
> > Subject: Spell Checking a multi word phrase
> >
> > Hello all,
> >
> > I'm pretty new to Solr, and trying to set up a spell checker that can
> > handle
> > entire phrases.  My goal would be to have something that could offer a
> > suggestion of "united states" for a query of "untied stats".
> >
> > I have a very large index, and I've worked a bit with creating shingles
> for
> > the spelling index.  The problem I'm running into now is that the
> > SpellCheckComponent is always tokenizing the query that I pass to it.
> >
> > For example, a query like this
> >
> >
> http://localhost:8080/solr/spell?q=untied\stats&spellcheck=true&debugQuery=on<http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on>
> <
> http://localhost:8080/solr/spell?q=untied%5Cstats&spellcheck=true&debugQuery=on
> >
> >
> > The debug information shows me that the parsed query is:
> > PhraseQuery(text:"untied stats")
> >
> > But I receive the spelling suggestions for "untied" and "stats"
> separately.
> > From what I understand, this is not a case where I would want to collate;
> I
> > simply want the entire phrase treated as one token.
> >
> > I found the following post after much searching that suggests setting up
> a
> > custom QueryConverter:
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3C1224516331.3820.119.camel@localhost.localdomain.tld%3E
> >
> > Does anyone know if that would be required?  I had hoped to avoid Java
> code
> > entirely with Solr (I haven't used Java in a very long time), but if I do
> > need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
> > able to give me some tips of exactly how I would add that functionality
> to
> > Solr?
> >
> > Relevant configs below:
> >
> > solrconfig.xml:
> >
> >  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
> >    <lst name="spellchecker">
> >      <str name="name">default</str>
> >      <str name="field">spellShingle</str>
> >      <str name="spellcheckIndexDir">./spellShingle</str>
> >      <str name="queryAnalyzerFieldType">textSpellShingle</str>
> >      <str name="buildOnOptimize">true</str>
> >    </lst>
> > </searchComponent>
> >
> > schema.xml:
> >
> >    <fieldType name="textSpellShingle" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> >        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > outputUnigrams="true"/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
> >
> > (I had thought setting the KeywordTokenizer for the query analyzer would
> > keep it from being tokenized, but it doesn't seem to make any difference)
> >
> > -Camden Daily
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message