lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: problems with PhraseHighlighter
Date Sun, 01 Nov 2009 17:06:50 GMT
> Copy-paste your field definition for
> the field you are trying to
> highlight/search on.
> 
> Cheers
> Avlesh

Thank you for your interest Avlesh,

My field type mostly contains custom filters and tokenizers.

<fieldType name="XMLText" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="XMLStripStandardTokenizerFactory" /> 
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true"
expand="true" /> 
  <filter class="CustomStemFilterFactory" protected="protwords.txt" /> 
  <filter class="LowerCaseFilterFactory" /> 
  </analyzer>
 <analyzer type="query">
  <tokenizer class="CustomTokenizerFactory" /> 
  <filter class="CustomDeasciifyFilterFactory" /> 
  <filter class="CustomStemFilterFactory" protected="protwords.txt" /> 
  <filter class="LowerCaseFilterFactory" /> 
  </analyzer>
  </fieldType>


Firstly I tried to use solr.HTMLStripCharFilterFactory to strip xml tags, it works fine but
when it comes to highlighting the <em> tags are replaced incorrect position. Same as
solr.HTMLStripStandardTokenizerFactory. The <em> tags are inserted interestingly exactly
one character before the actual term. So I added a new token definition to StandardTokenizer's
jflex file, to recogize xml tags and ingores them. I confirmed that it is working with some
testcases. It strips xml tags in tokenizer level. I am doing this because I am displaying
original documents with xml + xslt. Therefore i need to highlight xml files to display.

And I am using ComplexPhraseQueryParser [1].

But i reproduced the problem with &defType=lucene&q="term1 term2"~5 I see that term1
and term2 is 5 terms close to each other . Therefore it is returned. But highlighting is empty.
And there is no xml tags (stripped by tokenizer) between those terms in the original document.

hl.maxanalyzedchars parameter is about original document, right? I mean in my case including
xml tags too.

[1] http://lucene.apache.org/java/2_9_0/api/contrib-misc/org/apache/lucene/queryParser/complexPhrase/package-summary.html


      

Mime
View raw message