lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike L." <javaone...@yahoo.com.INVALID>
Subject Re: WordDelimiterFilterFactory - tokenizer question
Date Sun, 05 Apr 2015 16:40:38 GMT

Thanks Jack! That was oversight on my end - I also assumed the splitOnNumerics="1" and LowerCaseFilterFactory
would be breaking out the tokens. I tried again with generateWordParts="1" generateNumberParts="1"
and it seemed to work. Appreciate it.

Mike

      From: Jack Krupansky <jack.krupansky@gmail.com>
 To: solr-user@lucene.apache.org; Mike L. <javaone123@yahoo.com> 
 Sent: Sunday, April 5, 2015 8:23 AM
 Subject: Re: WordDelimiterFilterFactory - tokenizer question
   
You have to tell the filter what types of tokens to generate - words, numbers. You told it
to generate... nothing. You did tell it to preserve the original, unfiltered token though,
which is fine.
-- Jack Krupansky


On Sun, Apr 5, 2015 at 3:39 AM, Mike L. <javaone123@yahoo.com.invalid> wrote:

Solr User Group,
    I have a non-multivalied field with contains stored values similar to this:

US100AUS100BUS100CUS100-DUS100BBA
My assumption is - If I tokenized with the below fieldType definition, specifically the WDF
-splitOnNumbers and the LowerCaseFilterFactory would have have provided me solr matches on
the following query words:
?q=US 100?q=US100
across on field values. In other words, all US100A, US100B, US100C, US100-D would have matched
and scored against my qf weights. However - I'm not seeing that sort of behavior and have
tried various combinations and starting to question my assumptions on the tokenizer.

Ideally - I would like to return all values (US100A, US100B, US100C, US100-D) when for example,
q=US100A is searched on this field.

I know I should probably provide the debugQuery results, but was hoping this was a quick hit
for somebody and also I'm reindexing. WordDelimiterFilterFactory doesn't seem to be working
as expected. Hoping to get some clarification or if something sticks out here.

Below is the field type definition being used:
 <fieldType name="field_tokenized" class="solr.TextField" omitNorms="true">
       <analyzer type="index">
        <tokenizer  class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
         <filter class="solr.TrimFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" preserveOriginal="1"
generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
       </analyzer>
    
      <analyzer type="query">
        <tokenizer  class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
         <filter class="solr.TrimFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" 
generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
     </analyzer>
    </fieldType>


Thanks in advance.
Mike








  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message