lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Zambrano <czamb...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 22:02:41 GMT
Got it. Sorry for not having an answer for your problem.

On 10/06/2009 04:58 PM, Ravi Kiran wrote:
> You dont see any facet fields in my query because I have configured them in
> the solrconfig.xml to give specific fields as facets by default in the
> dismax and standard handlers so that I dont have to specify all those fields
> individually everytime I query, all I need to do is just set facet=true
> thats all
>
>    <requestHandler name="dismax" class="solr.SearchHandler" default="true">
>      <lst name="defaults">
>       <str name="defType">dismax</str>
>       <str name="echoParams">explicit</str>
>       <float name="tie">0.01</float>
>       <str name="qf">
>          systemid^20.0 headline^20.0 keyword^18.0 person^18.0
> organization^18.0 usstate^18.0 country^18.0 subject^18.0 quote^18.0
> blurb^15.0 articlesubhead^8.0 byline^7.0 articleblurb^2.0 body^1.5
> multimediablurb^1.5
>       </str>
>       <str name="pf">
>          headline^20.5 keyword^18.5 person^18.5 organization^18.5
> usstate^18.5 country^18.5 subject^18.5 quote^18.5 blurb^15.5
> articlesubhead^8.5 byline^7.5 articleblurb^2.5 body^2.0 multimediablurb^2.0
>       </str>
>       <str name="bf">
>          recip(rord(pubdatetime),1,1000,1000)^1.0
>       </str>
>       <str name="fl">
>          *
>       </str>
>       <str name="mm">
>          2&lt;-1 5&lt;-3 6&lt;90%
>       </str>
>       <int name="ps">100</int>
>       <str name="q.alt">*:*</str>
>       <!-- example highlighter config, enable per-query with hl=true -->
>       <str name="hl.fl">keyword</str>
>       <!-- for this field, we want no fragmenting, just highlighting -->
>       <str name="f.body.hl.fragsize">0</str>
>       <!-- instructs Solr to return the field itself if no query terms are
> found -->
>       <str name="f.name.hl.alternateField">keyword</str>
>       <str name="f.text.hl.fragmenter">regex</str>  <!-- defined below
-->
>       <str name="facet">false</str>
>       <int name="facet.mincount">1</int>
>       <int name="f.keyword.facet.mincount">5</int>
>       <int name="f.keywordlower.facet.mincount">5</int>
>       <int name="f.keywordformatted.facet.mincount">5</int>
>       <int name="f.person.facet.mincount">5</int>
>       <int name="f.personformatted.facet.mincount">5</int>
>       <int name="f.organization.facet.mincount">5</int>
>       <str name="facet.field">contenttype</str>
>       <str name="facet.field">keyword</str>
>       <str name="facet.field">keywordlower</str>
>       <str name="facet.field">keywordformatted</str>
>       <str name="facet.field">person</str>
>       <str name="facet.field">personformatted</str>
>       <str name="facet.field">organization</str>
>       <str name="facet.field">usstate</str>
>       <str name="facet.field">country</str>
>       <str name="facet.field">subject</str>
>      </lst>
>    </requestHandler>
>
>
> On Tue, Oct 6, 2009 at 5:45 PM, Christian Zambrano<czambran@gmail.com>wrote:
>
>    
>> I am stumped then. I had a similar issue when I was using a field that was
>> being heavily tokenized, but I corrected the issue by using a
>> field(generated using copyField) that doesn't get analyzed at all.
>>
>> On the query you provided before I didn't see the parameters to tell solr
>> for which field it should produce facets.
>>
>> Something like:
>>
>>
>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*
>>
>>
>>
>>
>> On 10/06/2009 04:09 PM, Ravi Kiran wrote:
>>
>>      
>>> Yes Exactly the same
>>>
>>> On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano<czambran@gmail.com
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> And you had the analyzer for that field set-up the same way as shown on
>>>> your previous e-mail when you indexed the data?
>>>>
>>>>
>>>>
>>>>
>>>> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> I did infact check it out any there is no weirdness in analysis
>>>>> page...see
>>>>> result below
>>>>>
>>>>> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>>>>> position 1 term text New York term type word source start,end 0,8
>>>>> payload
>>>>>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
>>>>> text
>>>>> New
>>>>> York term type word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term
>>>>> text
>>>>> New
>>>>> York term type word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>>>> type
>>>>> word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
>>>>> position 1 term text New York term type word source start,end 0,8
>>>>> payload
>>>>>   Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}
>>>>>   term
>>>>> position 1 term text New York term type word source start,end 0,8
>>>>> payload
>>>>>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term
>>>>> text
>>>>> New
>>>>> York term type word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term
>>>>> text
>>>>> New
>>>>> York term type word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>>>> type
>>>>> word source start,end 0,8 payload
>>>>>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
>>>>> position 1 term text New York term type word source start,end 0,8
>>>>> payload
>>>>>
>>>>>
>>>>> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czambran@gmail.com
>>>>>
>>>>>
>>>>>            
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Have you tried using the Analysis page to see what tokens are generated
>>>>>> for
>>>>>> the string "New York"? It could be one of the token filter is adding
>>>>>> the
>>>>>> token 'new' for all strings that start with 'new'
>>>>>>
>>>>>>
>>>>>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hello All,
>>>>>>>                Iam getting some ghost facets in solr 1.4. Can
anybody
>>>>>>> kindly
>>>>>>> help me understand why I get them and how to eliminate them.
My
>>>>>>> schema.xml
>>>>>>> snippet is given at the end. Iam indexing Named Entities extracted
via
>>>>>>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
>>>>>>> is
>>>>>>> that it will use all words as a single token, am I right ? for
>>>>>>> example:
>>>>>>> "New
>>>>>>> York" will be indexed as 'New York' and will not be split right???
>>>>>>> However
>>>>>>> I
>>>>>>> see then splitup in facets as follows when running the query
"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>>>>>>> "...but
>>>>>>> when I search with standard handler qt=standard&q=keyword:"New"
I dont
>>>>>>> find
>>>>>>> any doc which has just "New". After digging in a bit I found
that if
>>>>>>> several
>>>>>>> keywords have a common starting word it is being pulled out as
another
>>>>>>> facet
>>>>>>> like the following. Any help is greatly appreciated
>>>>>>>
>>>>>>> Result
>>>>>>> ------------
>>>>>>> <int name="New">47</int>        -------->    
Ghost
>>>>>>> <int name="New Hampshire">7</int>
>>>>>>> <int name="New Jersey">16</int>
>>>>>>> <int name="New Orleans">10</int>
>>>>>>> <int name="New York">147</int>
>>>>>>> <int name="New York City">23</int>
>>>>>>> <int name="New York Giants">8</int>
>>>>>>> <int name="New York Islanders">5</int>
>>>>>>> <int name="New York Mercantile Exchange">6</int>
>>>>>>> <int name="New York Mets">8</int>
>>>>>>> <int name="New York Stock Exchange">10</int>
>>>>>>> <int name="New York Times">8</int>
>>>>>>> <int name="New York University">5</int>
>>>>>>> <int name="New Zealand">7</int>
>>>>>>>
>>>>>>> <int name="Energy">7</int>        -------------->
    Ghost
>>>>>>> <int name="Energy Department">5</int>
>>>>>>> <int name="Energy Information Administration">5</int>
>>>>>>>
>>>>>>>
>>>>>>> <int name="Federal">7</int>      -------------->
    Ghost
>>>>>>> <int name="Federal Deposit Insurance Corp.">6</int>
>>>>>>> <int name="Federal Reserve">26</int>
>>>>>>> <int name="Federal Reserve Chairman">6</int>
>>>>>>>
>>>>>>> <int name="North">27</int>
>>>>>>> <int name="North Carolina">8</int>
>>>>>>> <int name="North Dakota">7</int>
>>>>>>> <int name="North Korea">12</int>
>>>>>>>
>>>>>>> Schema.xml
>>>>>>> -----------------
>>>>>>>
>>>>>>>      <fieldType name="keywordText" class="solr.TextField"
>>>>>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>>>>>        <analyzer type="index">
>>>>>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>>          <filter class="solr.TrimFilterFactory" />
>>>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>>>> enablePositionIncrements="true"/>
>>>>>>>
>>>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>>>> synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="false" />
>>>>>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>>        </analyzer>
>>>>>>>        <analyzer type="query">
>>>>>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>>>          <filter class="solr.TrimFilterFactory" />
>>>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>>>> enablePositionIncrements="true"
>>>>>>> />
>>>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>>>> synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="false" />
>>>>>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>>        </analyzer>
>>>>>>>      </fieldType>
>>>>>>>
>>>>>>>      <field name="person" type="keywordText" indexed="true"
>>>>>>> stored="true"
>>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>>> termOffsets="false"/>
>>>>>>>      <field name="organization" type="keywordText" indexed="true"
>>>>>>> stored="true" multiValued="true" termVectors="false"
>>>>>>> termPositions="false"
>>>>>>> termOffsets="false"/>
>>>>>>>      <field name="location" type="keywordText" indexed="true"
>>>>>>> stored="true"
>>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>>> termOffsets="false"/>
>>>>>>>      <field name="keyword" type="keywordText" indexed="true"
>>>>>>> stored="true"
>>>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>>>> termOffsets="false"/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>      
>    

Mime
View raw message