lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory
Date Fri, 14 Aug 2015 14:04:46 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697053#comment-14697053
] 

Jan Høydahl commented on SOLR-7926:
-----------------------------------

Hi. 

This kind of questions is more suited for the solr-user mailing list. Most likely this is
not a bug. Please ask the question on the list, and also tell which highlighter implementation
you use, with what configuration, and why you expect it to do what you want (refer to documentation)?
I'll close this jira as "Invalid".

If it ends up being a suspected bug or you find out your wanted result is not easily configurable
with any of the existing highlighter implementations, then please re-open.

> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the search term
when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer class="solr.StandardTokenizerFactory"/-->
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                                 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
replacement="" replace="all"/>
>                                 <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this:

> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 75 63 65
6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is highlighted). 
> 	
> If I change to use NGramFilterFactory it shows me this (for the first three items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the search term.
Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes
back to 6 also for the NGramFilterFactory. 
> 	



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message