lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: PostingHighlighter complains about no offsets
Date Sat, 03 May 2014 18:57:38 GMT
No not yet; but that could be one more reason to upgrade.  The 
performance boost from PH is quite nice. In my test, it's about 7x 
faster than the default highlighter, almost 2x faster than "fast" vector 
highlighter, and only about a 50% penalty compared to no highlighting at 
all, so this could be a huge win for us.  I haven't looked at the actual 
highlighting yet.  From what I understand the main sacrifice would be 
phrase-sensitive highlighting, but this could be a good tradeoff.

-Mike

On 5/3/2014 2:39 PM, Markus Jelsma wrote:
> Hello michael, you are not on lucene 4.8?
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-5111
>
>
> Michael Sokolov <msokolov@safaribooksonline.com> schreef:For posterity, in case
anybody follows this thread, I tracked the
> problem down to WordDelimiterFilter; apparently it creates an offset of
> -1 in some case, which PostingsHighlighter rejects.
>
> -Mike
>
>
> On 5/2/2014 10:20 AM, Michael Sokolov wrote:
>> I checked using the analysis admin page, and I believe there are
>> offsets being generated (I assume start/end=offsets).  So IDK I am
>> going to try reindexing again.  Maybe I neglected to reload the config
>> before I indexed last time.
>>
>> -Mike
>>
>> On 05/02/2014 09:34 AM, Michael Sokolov wrote:
>>> I've been wanting to try out the PostingsHighlighter, so I added
>>> storeOffsetsWithPositions to my field definition, enabled the
>>> highlighter in solrconfig.xml,  reindexed and tried it out. When I
>>> issue a query I'm getting this error:
>>>
>>> |field 'text' was indexed without offsets, cannot highlight
>>>
>>>
>>> java.lang.IllegalArgumentException: field 'text' was indexed without offsets,
cannot highlight
>>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
>>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
>>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
>>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
>>> I've been trying to figure out why the field wouldn't have offsets
>>> indexed, but I just can't see it.  Is there something in the analysis
>>> chain that could stripping out offsets?
>>>
>>>
>>> This is the field definition:
>>>
>>>       <field name="text" type="text_en" indexed="true" stored="true"
>>> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" storeOffsetsWithPositions="true" />
>>>
>>> (Yes I know PH doesn't require term vectors; I'm keeping them around
>>> for now while I experiment)
>>>
>>>       <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>         <analyzer type="index">
>>>           <!-- We are indexing mostly HTML so we need to ignore the
>>> tags -->
>>>           <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>           <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>           <!-- lower casing must happen before WordDelimiterFilter or
>>> protwords.txt will not work -->
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>           <filter class="solr.WordDelimiterFilterFactory"
>>> stemEnglishPossessive="1" protected="protwords.txt"/>
>>>           <!-- This deals with contractions -->
>>>           <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
>>>           <filter class="solr.HunspellStemFilterFactory"
>>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>         </analyzer>
>>>         <analyzer type="query">
>>>           <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>           <!-- lower casing must happen before WordDelimiterFilter or
>>> protwords.txt will not work -->
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>           <filter class="solr.WordDelimiterFilterFactory"
>>> protected="protwords.txt"/>
>>>           <!-- setting tokenSeparator="" solves issues with compound
>>> words and improves phrase search -->
>>>           <filter class="solr.HunspellStemFilterFactory"
>>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>         </analyzer>
>>>       </fieldType>


Mime
View raw message