lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Thu, 01 Jul 2010 09:40:15 GMT
Hi,

I think I would look at a hybrid approach, where you keep adding new synonyms to a query-side
qynonym dictionary for immediate effect. And then every now and then or every Nth night you
move those synonyms over to the index-side dictionary and trigger a full reindex.

A nice side effect of reindexing now and then could be that if your OpenNLP extraction dictionaries
have changed, it will be reflected too.

BTW: Could you share details of your OpenNLP integration with us? I'm about to do it on another
project..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 1. juli 2010, at 06.57, Ravi Kiran wrote:

> Hello,
>        Hoping some solr guru can help me out here. We are a news
> organization trying to migrate 10 million documents from FAST to solr. The
> plan is to have our Editorial team add/modify synonyms multiple times during
> a day as they deem appropriate. Hence we plan on using query time synonyms
> as we cannot reindex every time they modify the synonyms file(for the
> entities extracted by OpenNLP like locations/organizations/person names from
> article body) . Since the synonyms are for names Iam concerned that the
> multi-phrase issue crops up with the query-time synonyms. for example
> synonyms could be as follows
> 
> The Washington Post Co., The Washington Post, Washington Post, The Post,
> TWP, WAPO
> DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> 
> Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> Clinton,Sen. Clinton
> William J. Clinton,William Jefferson Clinton,President Clinton,President
> Bill Clinton
> 
> Virginia, Va., VA
> D.C,Washington D.C, District of Columbia
> 
> I have the following fieldType in schema.xml for the keywords/entites...What
> issues should I be aware off ? And is there a better way to achieve it
> without having to reindex a million docs on each synonym change. NOTE that I
> use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> SynonymFilterFactory to keep the words intact without splitting
> 
>    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
>    <fieldType name="keywordText" class="solr.TextField"
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>
> 
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
> />
>        <filter class="solr.SynonymFilterFactory"
> tokenizerFactory="solr.KeywordTokenizerFactory"
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> ignoreCase="true" expand="true" />
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>


Mime
View raw message