lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Schmidt <jeff_schm...@mac.com>
Subject Re: FastVectorHighlighter -> no highlights
Date Mon, 23 Apr 2012 19:26:54 GMT
This does not appear to be shingle specific.  A non-shingled field is also NOT highlighted
in the same manner with FVH.  I can see in the timing information that it takes much longer
to run FVH than no highlighting at all, so Solr must be doing something.  But why it just
lists the document IDs and little or no field highlights is still a mystery.

Any ideas on where I should look in the configuration, parameters to try etc.?

Cheers,

Jeff

On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote:

> I am using Solr 4.0, and debug=timing shows Solr spending the great majority of its time
in the HighlightComponent. It seemed logical to look into the FastVectorHighlighter.  I does
seem much faster, but on the other hand, I'm not getting the highlights I need. :)
> 
> I've seen references to FVH not supporting MultiTerm and (non-fixed sized) ngrams.  I'm
using edismax, and I don't know if a certain configuration of that becomes multi term and
that's my problem, or if the is something completely different. I don't have ngrams, but I
do shingle.  For the examples below, I have these fields defined:
> 
>       <field name="n_macromolecule_name" type="text_lc_np_shingle" indexed="true"
stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"
/>
>       <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" indexed="true"
stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"
/>
>       <field name="n_cellreg_disease" type="text_lc_np_shingle" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_macromolecule_summary" type="text_lc_np_shingle" indexed="true"
stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> Note that all are both indexed and stored, multi-valued, and I have  termVectors="true"
termPositions="true" termOffsets="true" to enable FVH. When I had missed that in a field,
I could see the log indicating such and reverting to the regular highlighter. I no longer
see those messages.  All of the above fields are of this type:
> 
>         <!-- A text field that forces lowercase, removes punctuation and generates
shingles for phrase matching -->
>        <fieldType name="text_lc_np_shingle" class="solr.TextField" positionIncrementGap="100">
>          <analyzer type="index">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
>            <!-- strip punctuation -->
>            <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"/>
>            <!-- Remove any 0-length tokens. -->
>            <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"
/>         
>          </analyzer>
>          <analyzer type="query">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <!-- strip punctuation -->
>            <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"/>
>            <!-- Remove any 0-length tokens. -->
>            <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="false"
outputUnigramsIfNoShingles="true"/>
>          </analyzer>
>        </fieldType>
> 
> 
> Using the standard highlight component, for the search term cancer (rows=2), I get the
highlights I've come to appreciate:
> 
>     <lst name="highlighting">
>         <lst name="ING:3lzx">
>             <arr name="n_macromolecule_name">
>                 <str>&lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;
susceptibility candidate 1</str>
>             </arr>
>             <arr name="n_protein_family">
>                 <str>&lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt;
susceptibility candidate 1</str>
>             </arr>
>         </lst>
>         <lst name="ING:8lj">
>             <arr name="n_macromolecule_name">
>                 <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;
2, early onset</str>
>             </arr>
>             <arr name="n_pathway_name">
>                 <str>Hereditary Breast &lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt;
Signaling</str>
>             </arr>
>             <arr name="n_cellreg_regulated_by">
>                 <str>prostate &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;
cells</str>
>             </arr>
>             <arr name="n_cellreg_disease">
>                 <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;</str>
>             </arr>
>             <arr name="n_macromolecule_summary">
>                 <str> mutations in BRCA1 and this gene, BRCA2, confer increased
lifetime risk of developing breast or ovarian &lt;span class="ingReasonText"&gt;cancer.&lt;/span&gt;</str>
>             </arr>
>         </lst>
>     </lst>
> 
> With everything else being the same, when I set hl.useFastVectorHighlighter=true I get:
> 
>     <lst name="highlighting">
>         <lst name="ING:3lzx"/>
>         <lst name="ING:8lj">
>             <arr name="n_macromolecule_summary">
>                 <str>breast or &lt;span class="ingReasonText"&gt;ovarian&lt;/span&gt;
cancer. Both BRCA1 and BRCA2 are involved in maintenance of genome stability, specifically</str>
>             </arr>
>         </lst>
>     </lst>
> 
> Note that the same fields simply do not appear, except for n_macromolecule_summary, in
which case it's for some reason highlighting "ovarian" instead of "cancer".
> 
> Highlight related configuration is in the edismax request handler:
> 
>      <str name="hl.requireFieldMatch">true</str>
>      <str name="hl.usePhraseHighlighter">true</str>
>      <str name="hl.phraseLimit">5000</str>
>      <str name="hl.fragListBuilder">simple</str>
>      <str name="hl.fragmentsBuilder">colored</str>
>      <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str>
>      <str name="hl.simple.post"><![CDATA[</span>]]></str>
>      <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str>
>      <str name="hl.tag.post"><![CDATA[</span>]]></str>
>      
>      <!-- for this field, we want no fragmenting, just highlighting -->
>      <str name="f.name.hl.fragsize">0</str>
>      <!-- instructs Solr to return the field itself if no query terms are
>           found
>      <str name="f.name.hl.alternateField">name</str> -->
>      <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
> 
> Any ideas on what I'm doing wrong?  Sorry for the long email, but I"m trying to answer
as many anticipated configuration questions as I can. Is there a problem with FVH and shingling?
 Hopefully it's something else?
> 
> Thanks,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

--
Jeff Schmidt
jeff_schmidt@mac.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message