lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Zambrano <czamb...@gmail.com>
Subject Re: Weird Facet and KeywordTokenizerFactory Issue
Date Tue, 06 Oct 2009 20:19:48 GMT
Have you tried using the Analysis page to see what tokens are generated 
for the string "New York"? It could be one of the token filter is adding 
the token 'new' for all strings that start with 'new'

On 10/06/2009 02:54 PM, Ravi Kiran wrote:
> Hello All,
>                Iam getting some ghost facets in solr 1.4. Can anybody kindly
> help me understand why I get them and how to eliminate them. My schema.xml
> snippet is given at the end. Iam indexing Named Entities extracted via
> OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
> that it will use all words as a single token, am I right ? for example: "New
> York" will be indexed as 'New York' and will not be split right??? However I
> see then splitup in facets as follows when running the query "
> http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
> when I search with standard handler qt=standard&q=keyword:"New" I dont find
> any doc which has just "New". After digging in a bit I found that if several
> keywords have a common starting word it is being pulled out as another facet
> like the following. Any help is greatly appreciated
>
> Result
> ------------
> <int name="New">47</int>     -------->  Ghost
> <int name="New Hampshire">7</int>
> <int name="New Jersey">16</int>
> <int name="New Orleans">10</int>
> <int name="New York">147</int>
> <int name="New York City">23</int>
> <int name="New York Giants">8</int>
> <int name="New York Islanders">5</int>
> <int name="New York Mercantile Exchange">6</int>
> <int name="New York Mets">8</int>
> <int name="New York Stock Exchange">10</int>
> <int name="New York Times">8</int>
> <int name="New York University">5</int>
> <int name="New Zealand">7</int>
>
> <int name="Energy">7</int>     -------------->  Ghost
> <int name="Energy Department">5</int>
> <int name="Energy Information Administration">5</int>
>
>
> <int name="Federal">7</int>   -------------->  Ghost
> <int name="Federal Deposit Insurance Corp.">6</int>
> <int name="Federal Reserve">26</int>
> <int name="Federal Reserve Chairman">6</int>
>
> <int name="North">27</int>
> <int name="North Carolina">8</int>
> <int name="North Dakota">7</int>
> <int name="North Korea">12</int>
>
> Schema.xml
> -----------------
>
>      <fieldType name="keywordText" class="solr.TextField"
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.TrimFilterFactory" />
>          <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>
>
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.KeywordTokenizerFactory"/>
>          <filter class="solr.TrimFilterFactory" />
>          <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
> />
>          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
>      <field name="person" type="keywordText" indexed="true" stored="true"
> multiValued="true" termVectors="false" termPositions="false"
> termOffsets="false"/>
>      <field name="organization" type="keywordText" indexed="true"
> stored="true" multiValued="true" termVectors="false" termPositions="false"
> termOffsets="false"/>
>      <field name="location" type="keywordText" indexed="true" stored="true"
> multiValued="true" termVectors="false" termPositions="false"
> termOffsets="false"/>
>      <field name="keyword" type="keywordText" indexed="true" stored="true"
> multiValued="true" termVectors="false" termPositions="false"
> termOffsets="false"/>
>
>    

Mime
View raw message