lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 20:46:06 GMT
I did infact check it out any there is no weirdness in analysis page...see
result below

Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 Query Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {}  term
position 1 term text New York term type word source start,end 0,8 payload
 org.apache.solr.analysis.TrimFilterFactory {}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.StopFilterFactory {words=entity-stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}  term position 1 term text New
York term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=false, ignoreCase=true}  term position 1 term text New York term type
word source start,end 0,8 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term
position 1 term text New York term type word source start,end 0,8 payload


On Tue, Oct 6, 2009 at 4:19 PM, Christian Zambrano <czambran@gmail.com>wrote:

> Have you tried using the Analysis page to see what tokens are generated for
> the string "New York"? It could be one of the token filter is adding the
> token 'new' for all strings that start with 'new'
>
>
> On 10/06/2009 02:54 PM, Ravi Kiran wrote:
>
>> Hello All,
>>               Iam getting some ghost facets in solr 1.4. Can anybody
>> kindly
>> help me understand why I get them and how to eliminate them. My schema.xml
>> snippet is given at the end. Iam indexing Named Entities extracted via
>> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
>> that it will use all words as a single token, am I right ? for example:
>> "New
>> York" will be indexed as 'New York' and will not be split right??? However
>> I
>> see then splitup in facets as follows when running the query "
>>
>> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1
>> "...but
>> when I search with standard handler qt=standard&q=keyword:"New" I dont
>> find
>> any doc which has just "New". After digging in a bit I found that if
>> several
>> keywords have a common starting word it is being pulled out as another
>> facet
>> like the following. Any help is greatly appreciated
>>
>> Result
>> ------------
>> <int name="New">47</int>     -------->  Ghost
>> <int name="New Hampshire">7</int>
>> <int name="New Jersey">16</int>
>> <int name="New Orleans">10</int>
>> <int name="New York">147</int>
>> <int name="New York City">23</int>
>> <int name="New York Giants">8</int>
>> <int name="New York Islanders">5</int>
>> <int name="New York Mercantile Exchange">6</int>
>> <int name="New York Mets">8</int>
>> <int name="New York Stock Exchange">10</int>
>> <int name="New York Times">8</int>
>> <int name="New York University">5</int>
>> <int name="New Zealand">7</int>
>>
>> <int name="Energy">7</int>     -------------->  Ghost
>> <int name="Energy Department">5</int>
>> <int name="Energy Information Administration">5</int>
>>
>>
>> <int name="Federal">7</int>   -------------->  Ghost
>> <int name="Federal Deposit Insurance Corp.">6</int>
>> <int name="Federal Reserve">26</int>
>> <int name="Federal Reserve Chairman">6</int>
>>
>> <int name="North">27</int>
>> <int name="North Carolina">8</int>
>> <int name="North Dakota">7</int>
>> <int name="North Korea">12</int>
>>
>> Schema.xml
>> -----------------
>>
>>     <fieldType name="keywordText" class="solr.TextField"
>> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.TrimFilterFactory" />
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt,entity-stopwords.txt"
>> enablePositionIncrements="true"/>
>>
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false" />
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.TrimFilterFactory" />
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
>> />
>>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false" />
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>
>>     <field name="person" type="keywordText" indexed="true" stored="true"
>> multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>>     <field name="organization" type="keywordText" indexed="true"
>> stored="true" multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>>     <field name="location" type="keywordText" indexed="true" stored="true"
>> multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>>     <field name="keyword" type="keywordText" indexed="true" stored="true"
>> multiValued="true" termVectors="false" termPositions="false"
>> termOffsets="false"/>
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message