lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Zambrano <czamb...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 21:45:59 GMT
I am stumped then. I had a similar issue when I was using a field that 
was being heavily tokenized, but I corrected the issue by using a 
field(generated using copyField) that doesn't get analyzed at all.

On the query you provided before I didn't see the parameters to tell 
solr for which field it should produce facets.

Something like:

http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1&*facet.field=location*



On 10/06/2009 04:09 PM, Ravi Kiran wrote:
> Yes Exactly the same
>
> On Tue, Oct 6, 2009 at 4:52 PM, Christian Zambrano<czambran@gmail.com>wrote:
>
>    
>> And you had the analyzer for that field set-up the same way as shown on
>> your previous e-mail when you indexed the data?
>>
>>
>>
>>
>> On 10/06/2009 03:46 PM, Ravi Kiran wrote:
>>
>>      
>>> I did infact check it out any there is no weirdness in analysis page...see
>>> result below
>>>
>>> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>>> position 1 term text New York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>>> New
>>> York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>>> New
>>> York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>> type
>>> word source start,end 0,8 payload
>>>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
>>> position 1 term text New York term type word source start,end 0,8 payload
>>>   Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
>>> position 1 term text New York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text
>>> New
>>> York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
>>> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text
>>> New
>>> York term type word source start,end 0,8 payload
>>>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>>> expand=false, ignoreCase=true}  term position 1 term text New York term
>>> type
>>> word source start,end 0,8 payload
>>>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
>>> position 1 term text New York term type word source start,end 0,8 payload
>>>
>>>
>>> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czambran@gmail.com
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> Have you tried using the Analysis page to see what tokens are generated
>>>> for
>>>> the string "New York"? It could be one of the token filter is adding the
>>>> token 'new' for all strings that start with 'new'
>>>>
>>>>
>>>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Hello All,
>>>>>                Iam getting some ghost facets in solr 1.4. Can anybody
>>>>> kindly
>>>>> help me understand why I get them and how to eliminate them. My
>>>>> schema.xml
>>>>> snippet is given at the end. Iam indexing Named Entities extracted via
>>>>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory
is
>>>>> that it will use all words as a single token, am I right ? for example:
>>>>> "New
>>>>> York" will be indexed as 'New York' and will not be split right???
>>>>> However
>>>>> I
>>>>> see then splitup in facets as follows when running the query "
>>>>>
>>>>>
>>>>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>>>>> "...but
>>>>> when I search with standard handler qt=standard&q=keyword:"New" I
dont
>>>>> find
>>>>> any doc which has just "New". After digging in a bit I found that if
>>>>> several
>>>>> keywords have a common starting word it is being pulled out as another
>>>>> facet
>>>>> like the following. Any help is greatly appreciated
>>>>>
>>>>> Result
>>>>> ------------
>>>>> <int name="New">47</int>       -------->    Ghost
>>>>> <int name="New Hampshire">7</int>
>>>>> <int name="New Jersey">16</int>
>>>>> <int name="New Orleans">10</int>
>>>>> <int name="New York">147</int>
>>>>> <int name="New York City">23</int>
>>>>> <int name="New York Giants">8</int>
>>>>> <int name="New York Islanders">5</int>
>>>>> <int name="New York Mercantile Exchange">6</int>
>>>>> <int name="New York Mets">8</int>
>>>>> <int name="New York Stock Exchange">10</int>
>>>>> <int name="New York Times">8</int>
>>>>> <int name="New York University">5</int>
>>>>> <int name="New Zealand">7</int>
>>>>>
>>>>> <int name="Energy">7</int>       -------------->    Ghost
>>>>> <int name="Energy Department">5</int>
>>>>> <int name="Energy Information Administration">5</int>
>>>>>
>>>>>
>>>>> <int name="Federal">7</int>     -------------->    Ghost
>>>>> <int name="Federal Deposit Insurance Corp.">6</int>
>>>>> <int name="Federal Reserve">26</int>
>>>>> <int name="Federal Reserve Chairman">6</int>
>>>>>
>>>>> <int name="North">27</int>
>>>>> <int name="North Carolina">8</int>
>>>>> <int name="North Dakota">7</int>
>>>>> <int name="North Korea">12</int>
>>>>>
>>>>> Schema.xml
>>>>> -----------------
>>>>>
>>>>>      <fieldType name="keywordText" class="solr.TextField"
>>>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>>>        <analyzer type="index">
>>>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>          <filter class="solr.TrimFilterFactory" />
>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>> enablePositionIncrements="true"/>
>>>>>
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="false" />
>>>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>        </analyzer>
>>>>>        <analyzer type="query">
>>>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>>>          <filter class="solr.TrimFilterFactory" />
>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt,entity-stopwords.txt"
>>>>> enablePositionIncrements="true"
>>>>> />
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="false" />
>>>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>        </analyzer>
>>>>>      </fieldType>
>>>>>
>>>>>      <field name="person" type="keywordText" indexed="true" stored="true"
>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>> termOffsets="false"/>
>>>>>      <field name="organization" type="keywordText" indexed="true"
>>>>> stored="true" multiValued="true" termVectors="false"
>>>>> termPositions="false"
>>>>> termOffsets="false"/>
>>>>>      <field name="location" type="keywordText" indexed="true"
>>>>> stored="true"
>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>> termOffsets="false"/>
>>>>>      <field name="keyword" type="keywordText" indexed="true"
>>>>> stored="true"
>>>>> multiValued="true" termVectors="false" termPositions="false"
>>>>> termOffsets="false"/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>
>>>        
>>      
>    

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message