lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Willmer <al.will...@logica.com>
Subject StandardTokenizer and domain names containing digits
Date Thu, 19 Apr 2012 16:04:21 GMT
TLDR; How should I make Solr treat "ns1.define.logica.com" as a single token in 
the same way "ns.define.logica.com" would be?

We are just starting to use Solr 3.5.0 in production and have run into a 
slightly surprising behaviour involving the query "ns1.define.logica.com", 
through an edismax handler with "q.op"=AND defined with

<requestHandler name="search" class="solr.SearchHandler" default="true">
 <lst name="defaults">
   <str name="echoParams">explicit</str>
   <int name="rows">10</int>
   <!-- #define customisations -->
   <str name="defType">edismax</str>
   <str name="q.op">AND</str>
   <str name="qf">
    body^0.5 comments^0.4 tags^1.2 title^2.0 involved^1.5 id^10.0
    author^10.9 changed created oneline^0.7
   </str>
   <str name="pf">
    body^0.2 tags^1.1 title^1.5
   </str>
 </lst>
</requestHandler>

The schema is defined with fields of type text_general, as found in the example 
schema.xml, namely:

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The search string is being tokenised to "ns2", "define.logica.com", and the 
resulting query becomes

+DisjunctionMaxQuery((((tags:ns1 tags:define.logica.com)^1.2) | 
id:ns1.define.logica.com^10.0 | ((body:ns1 body:define.logica.com)^0.5) | 
((author:ns1 author:define.logica.com)^10.9) | ((oneline:ns1 
oneline:define.logica.com)^0.7) | ((title:ns1 title:define.logica.com)^2.0) | 
((involved:ns1 involved:define.logica.com)^1.5) | ((comments:ns1 
comments:define.logica.com)^0.4))) DisjunctionMaxQuery((tags:"ns1 
define.logica.com"^1.1 | body:"ns1 define.logica.com"^0.2 | title:"ns1 
define.logica.com"^1.5))

meaning that documents containing "ns1" OR "define.logica.com" are returned. 
This is contrary to e.g. "ns.logica.define.com" which is treated as a single 
token. Is there a way I can make Solr treat both queries the same way?

Many thanks, Alex
-- 
Alex Willmer | Developer
2 Trinity Park,  Birmingham, B37 7ES | United Kingdom 
M: +44 7557 752744
al.willmer@logica.com | www.logica.com
Logica UK Ltd, registered in UK (registered number 947968)
Registered Office: 250 Brook Drive, Green Park, Reading RG2 6UA, United Kingdom



Mime
View raw message