lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bjørn Hjelle (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory
Date Fri, 14 Aug 2015 10:53:45 GMT

     [ https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bjørn Hjelle updated SOLR-7926:
-------------------------------
    Description: 
Hit highlight highlights the whole word, not just the part that matches the search term when
using EdgeNGramFilterFactory in the field type.

In schema.xml I have field type text_ngram:

                <fieldType name="text_ngram" class="solr.TextField">
                        <analyzer type="index">
                                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                           <!--tokenizer class="solr.StandardTokenizerFactory"/-->
                                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="3" luceneMatchVersion="4.3"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
                        </analyzer>
                        <analyzer type="query">
                                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
                                <tokenizer class="solr.StandardTokenizerFactory"/>
                                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
replacement="$1" replace="all"/>
                        </analyzer>
                </fieldType>

And dynamic field: 

   <dynamicField name="*_n"  type="text_ngram"    indexed="true"  stored="true"/>

In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this: 

LENGTF text             luc         luce            lucen               lucene
       raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 75 63 65 6e 65]
       start            0           0               0                   0
       end              6           6               6                   6   
       positionLength   1           1               1                   1    
       type             word        word            word                word
       position         1           1               1                   1    

Since the end position is 6 in this case the whole word ("lucene" is highlighted). 
	
If I change to use NGramFilterFactory it shows me this (for the first three items):

LENGTF text             luc         uce             cen               
       raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
       start            0           1               2                 
       end              3           4               5                   
       positionLength   1           1               1                    
       type             word        word            word            
       position         1           1               1               

The end position is correct then and the highlighter highlights only the search term. Note
that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back
to 6 also for the NGramFilterFactory. 

	

  was:
Hithighlight highlights the whole word, not just the part that matches the search term when
using EdgeNGramFilterFactory in the field type.

In schema.xml I have field type text_ngram:

                <fieldType name="text_ngram" class="solr.TextField">
                        <analyzer type="index">
                                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
                                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                           <!--tokenizer class="solr.StandardTokenizerFactory"/-->
                                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="3" luceneMatchVersion="4.3"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
                        </analyzer>
                        <analyzer type="query">
                                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
                                <tokenizer class="solr.StandardTokenizerFactory"/>
                                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                                <filter class="solr.LowerCaseFilterFactory"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
                                <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
replacement="$1" replace="all"/>
                        </analyzer>
                </fieldType>

And dynamic field: 

   <dynamicField name="*_n"  type="text_ngram"    indexed="true"  stored="true"/>

In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this: 

LENGTF text             luc         luce            lucen               lucene
       raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 75 63 65 6e 65]
       start            0           0               0                   0
       end              6           6               6                   6   
       positionLength   1           1               1                   1    
       type             word        word            word                word
       position         1           1               1                   1    

Since the end position is 6 in this case the whole word ("lucene" is highlighted). 
	
If I change to use NGramFilterFactory it shows me this (for the first three items):

LENGTF text             luc         uce             cen               
       raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
       start            0           1               2                 
       end              3           4               5                   
       positionLength   1           1               1                    
       type             word        word            word            
       position         1           1               1               

The end position is correct then and the highlighter highlights only the search term. Note
that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back
to 6 also for the NGramFilterFactory. 

	


> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the search term
when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer class="solr.StandardTokenizerFactory"/-->
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> And dynamic field: 
>    <dynamicField name="*_n"  type="text_ngram"    indexed="true"  stored="true"/>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this:

> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 75 63 65
6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is highlighted). 
> 	
> If I change to use NGramFilterFactory it shows me this (for the first three items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the search term.
Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes
back to 6 also for the NGramFilterFactory. 
> 	



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message