lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From skmirch <skmi...@hotmail.com>
Subject Solr Multiword Search
Date Mon, 01 Apr 2013 23:50:10 GMT
We have a catalog of media content which is ingested into solr.   We are
trying to do a spell check on the title of the catalog item, to make sure
that the client is able to correctly predict and correct the (mis)typed
text. The requirement is that corrected text match a title in the catalog. 

I have been playing around with spellcheck component and the handler on SOLR
4.2 .  

solrconfig.xml
--------------
    <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

       <str name="queryAnalyzerFieldType">text_spell</str>

     <lst name="spellchecker">
       <str name="name">default</str>
       <str name="field">mySpell</str>
       <str name="classname">solr.DirectSolrSpellChecker</str>
       <str name="distanceMeasure">internal</str>
       <float name="accuracy">0.5</float>
       <int name="maxEdits">2</int>
       <int name="minPrefix">1</int>
       <int name="maxInspections">5</int>
       <int name="minQueryLength">4</int>
       <float name="maxQueryFrequency">0.01</float>
       
     </lst>
    </searchComponent>

  <queryConverter name="queryConverter"
class="com.foo.MultiWordSpellingQueryConverter"/>

  <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="df">mySpell</str>
      
      
      <str name="spellcheck.dictionary">default</str>
      <str name="spellcheck">on</str>
      <str name="spellcheck.extendedResults">true</str>
      <str name="spellcheck.count">10</str>
      <str name="spellcheck.alternativeTermCount">5</str>
      <str name="spellcheck.maxResultsForSuggest">5</str>
      <str name="spellcheck.collate">true</str>
      <str name="spellcheck.collateExtendedResults">true</str>
      <str name="spellcheck.maxCollationTries">10</str>
      <str name="spellcheck.maxCollations">10</str>
    </lst>
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

schema.xml
------------
    <types>
                <fieldType name="text_spell" class="solr.TextField"
sortMissingLast="true" omitNorms="true" omitTermFreqAndPositions="true">
                        <analyzer>
                                <tokenizer
class="solr.KeywordTokenizerFactory" />
                                <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1" preserveOriginal="0" />
                                <filter class="solr.LowerCaseFilterFactory"
/>
                                <filter
class="solr.RemoveDuplicatesTokenFilterFactory" />
                                
                        </analyzer>
                </fieldType>
   </types>

<fields>
   <field name="mySpell" type="text_spell" indexed="true" stored="true"
multiValued="true" />
</fields
   <copyField source="title" dest="mySpell" />

Notice that I am using a custom QueryConverter, with definitions as follows:

/* MultiWordSpellingQueryConverter.java */
package com.foo;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Token;
import org.apache.solr.spelling.QueryConverter;

public class MultiWordSpellingQueryConverter extends QueryConverter {
	private static Logger log =
Logger.getLogger(MultiWordSpellingQueryConverter.class);

	static {
		System.out.println("********* Loading class
MultiWordSpellingQueryConverter");
		log.fatal("********* Loading class MultiWordSpellingQueryConverter");
	}
	
    /**
     * Converts the original query string to a collection of Lucene Tokens.
     * 
     * @param original the original query string
     * @return a Collection of Lucene Tokens
     */
    public Collection<Token> convert( String original ) {
        if ( original == null ) {
            return Collections.emptyList();
        }
        System.out.println("Original String : "+original);
        log.error("Original String : "+original);
        final Token token = new Token( original.toCharArray(), 0,
original.length(), 0, original.length() );
        return Arrays.asList( token );
    }
    
}

I have followed directions as per another thread :
http://lucene.472066.n3.nabble.com/Full-sentence-spellcheck-tt3265257.html#a3281189
, because I feel this is what I really want.

I have tried both placing the jar in the ${solr.home}/lib directory and
un-jarring solr.war and adding the jar file created with the above Java
compiled code into the WEB-INF/lib directory and re jarring it and placing
it in the web-server deploy directory.   I cannot tell if this file is even
being invoked at spellcheck time.  I have queryConverter tag defined in the
solrconfig.xml file (refer to the solrconfig.xml definitions above).

Query:
http://localhost/solr/spell?q=((title:("charles%20and%20the%20chocolate%20factory")))&spellcheck.q=charles%20and%20the%20chocolat%20factory&spellcheck=true&spellcheck.collate=true

Of course I have spelt charles incorrectly.  There in fact exists in the
catalog, a title with the name "Charlie and the chocolate factory" and the
above query does not find it nor collate well enough to correct the
spelling.  I believe the error distance (or edits) is about 2.  Charles
should be spelt Charlie so based on Levenshtein's algorithm,  it would find
this as the best quickly find it and suggest it. 

Suggestions from my script look like the following:
Title|Hits
charles and the chocolate factory|205808|
charles and the chocolate factor|205631|
charles and the chocolates factory|205508|
charley and the chocolate factory|203594|
charles and the chocolata factory|205506|
charles and the chocolate factoria|205544|
charles and the chocolates factor|205330|
charlet and the chocolate factory|203441|
charley and the chocolate factor|203417|
charley and the chocolates factory|203294|

In the collations the above list is the list of suggested collations and the
number of hits all extracted from the response XML to the above query.

What I would expect to see is "Charlie and the Chocolate Factory" way at the
top of the list since it is in my Catalog verbatim.  None of the above
listed collated suggestions are in the catalog.

Not sure how I can achieve my goal of being able to suggest a corrected
phrase that exists in the title in my catalog.  I would appreciate any help
on this front.

Thanks in advance.
Regards,
-- Sandeep



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message