lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Thu, 01 Jul 2010 18:06:52 GMT
Hello Mr. Høydahl,
                          I thought of doing it exactly as you have said,
Shall try out and see where I land. However Iam still skeptical about that
approach from the performance point of view as we are a round the clock news
organization and huge reindexing might affect the speed of searches moreover
in the news business "being first" is more important hence we need those
synonyms to take affect right away and thats where we are in a quandry

   With regards to the OpenNLP implementation, our design is plain vanilla
outside of SOLR. We generate the XML on the fly with extracted entities from
OpenNLP and then index it straight into SOLR. However, we do some sanity
checks for locations prior to indexing using wordnet so that false positives
are avoided in location names.

Thanks,

Ravi Kiran Bhaskar

On Thu, Jul 1, 2010 at 5:40 AM, Jan Høydahl / Cominvent <
jan.asf@cominvent.com> wrote:

> Hi,
>
> I think I would look at a hybrid approach, where you keep adding new
> synonyms to a query-side qynonym dictionary for immediate effect. And then
> every now and then or every Nth night you move those synonyms over to the
> index-side dictionary and trigger a full reindex.
>
> A nice side effect of reindexing now and then could be that if your OpenNLP
> extraction dictionaries have changed, it will be reflected too.
>
> BTW: Could you share details of your OpenNLP integration with us? I'm about
> to do it on another project..
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 1. juli 2010, at 06.57, Ravi Kiran wrote:
>
> > Hello,
> >        Hoping some solr guru can help me out here. We are a news
> > organization trying to migrate 10 million documents from FAST to solr.
> The
> > plan is to have our Editorial team add/modify synonyms multiple times
> during
> > a day as they deem appropriate. Hence we plan on using query time
> synonyms
> > as we cannot reindex every time they modify the synonyms file(for the
> > entities extracted by OpenNLP like locations/organizations/person names
> from
> > article body) . Since the synonyms are for names Iam concerned that the
> > multi-phrase issue crops up with the query-time synonyms. for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security
> > USCIS, United States Citizenship and Immigration Services, U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> keywords/entites...What
> > issues should I be aware off ? And is there a better way to achieve it
> > without having to reindex a million docs on each synonym change. NOTE
> that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for the
> > SynonymFilterFactory to keep the words intact without splitting
> >
> >    <!--  Field Type Keywords/Entities Extracted from OpenNLP -->
> >    <fieldType name="keywordText" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"/>
> >
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >        <filter class="solr.TrimFilterFactory" />
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> enablePositionIncrements="true"
> > />
> >        <filter class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >      </analyzer>
> >    </fieldType>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message