lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Zambrano <czamb...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 20:52:46 GMT
And you had the analyzer for that field set-up the same way as shown on 
your previous e-mail when you indexed the data?



On 10/06/2009 03:46 PM, Ravi Kiran wrote:
> I did infact check it out any there is no weirdness in analysis page...see
> result below
>
> Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
> position 1 term text New York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
> York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
> York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=false, ignoreCase=true}  term position 1 term text New York term type
> word source start,end 0,8 payload
>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
> position 1 term text New York term type word source start,end 0,8 payload
>   Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
> position 1 term text New York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
> York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
> ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
> York term type word source start,end 0,8 payload
>   org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=false, ignoreCase=true}  term position 1 term text New York term type
> word source start,end 0,8 payload
>   org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
> position 1 term text New York term type word source start,end 0,8 payload
>
>
> On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano<czambran@gmail.com>wrote:
>
>    
>> Have you tried using the Analysis page to see what tokens are generated for
>> the string "New York"? It could be one of the token filter is adding the
>> token 'new' for all strings that start with 'new'
>>
>>
>> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>>
>>      
>>> Hello All,
>>>                Iam getting some ghost facets in solr 1.4. Can anybody
>>> kindly
>>> help me understand why I get them and how to eliminate them. My schema.xml
>>> snippet is given at the end. Iam indexing Named Entities extracted via
>>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
>>> that it will use all words as a single token, am I right ? for example:
>>> "New
>>> York" will be indexed as 'New York' and will not be split right??? However
>>> I
>>> see then splitup in facets as follows when running the query "
>>>
>>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>>> "...but
>>> when I search with standard handler qt=standard&q=keyword:"New" I dont
>>> find
>>> any doc which has just "New". After digging in a bit I found that if
>>> several
>>> keywords have a common starting word it is being pulled out as another
>>> facet
>>> like the following. Any help is greatly appreciated
>>>
>>> Result
>>> ------------
>>> <int name="New">47</int>      -------->   Ghost
>>> <int name="New Hampshire">7</int>
>>> <int name="New Jersey">16</int>
>>> <int name="New Orleans">10</int>
>>> <int name="New York">147</int>
>>> <int name="New York City">23</int>
>>> <int name="New York Giants">8</int>
>>> <int name="New York Islanders">5</int>
>>> <int name="New York Mercantile Exchange">6</int>
>>> <int name="New York Mets">8</int>
>>> <int name="New York Stock Exchange">10</int>
>>> <int name="New York Times">8</int>
>>> <int name="New York University">5</int>
>>> <int name="New Zealand">7</int>
>>>
>>> <int name="Energy">7</int>      -------------->   Ghost
>>> <int name="Energy Department">5</int>
>>> <int name="Energy Information Administration">5</int>
>>>
>>>
>>> <int name="Federal">7</int>    -------------->   Ghost
>>> <int name="Federal Deposit Insurance Corp.">6</int>
>>> <int name="Federal Reserve">26</int>
>>> <int name="Federal Reserve Chairman">6</int>
>>>
>>> <int name="North">27</int>
>>> <int name="North Carolina">8</int>
>>> <int name="North Dakota">7</int>
>>> <int name="North Korea">12</int>
>>>
>>> Schema.xml
>>> -----------------
>>>
>>>      <fieldType name="keywordText" class="solr.TextField"
>>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>>        <analyzer type="index">
>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>          <filter class="solr.TrimFilterFactory" />
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt"
>>> enablePositionIncrements="true"/>
>>>
>>>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="false" />
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>        <analyzer type="query">
>>>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>          <filter class="solr.TrimFilterFactory" />
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
>>> />
>>>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="false" />
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>>      <field name="person" type="keywordText" indexed="true" stored="true"
>>> multiValued="true" termVectors="false" termPositions="false"
>>> termOffsets="false"/>
>>>      <field name="organization" type="keywordText" indexed="true"
>>> stored="true" multiValued="true" termVectors="false" termPositions="false"
>>> termOffsets="false"/>
>>>      <field name="location" type="keywordText" indexed="true" stored="true"
>>> multiValued="true" termVectors="false" termPositions="false"
>>> termOffsets="false"/>
>>>      <field name="keyword" type="keywordText" indexed="true" stored="true"
>>> multiValued="true" termVectors="false" termPositions="false"
>>> termOffsets="false"/>
>>>
>>>
>>>
>>>        
>>      
>    

Mime
View raw message