lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <ravi.bhas...@gmail.com>
Subject Re: Dilemma - Very Frequent Synonym updates for Huge Index
Date Thu, 01 Jul 2010 17:57:58 GMT
Hello Mr.Arslan,
                       Thank you for promptly responding. This solution is
for searching topics which would provide a aggregation of all content
related to that Topic (like articles/photos/videos etc). So any point of
time the user will be searching for one topic only, for example : Barack
Obama / Oracle Corp. / Iraq / Gulf Oil Spill. So the user is never allowed
to do natural search like entering multiple disparate keywords/entities like
"Barack Obama Gulf oil Spill". Bottomline it is entity search. If I did not
make any sense to you take a look at what New York Times does in url given
below...thats exactly what Iam trying to do

http://topics.nytimes.com/topics/reference/timestopics/all/b/index.html

Thanks,

Ravi Kiran Bhaskar


On Thu, Jul 1, 2010 at 7:04 AM, Ahmet Arslan <iorixxx@yahoo.com> wrote:

>
>
> --- On Thu, 7/1/10, Ravi Kiran <ravi.bhaskar@gmail.com> wrote:
>
> > From: Ravi Kiran <ravi.bhaskar@gmail.com>
> > Subject: Dilemma - Very Frequent Synonym updates for Huge Index
> > To: solr-user@lucene.apache.org
> > Date: Thursday, July 1, 2010, 7:57 AM
> > Hello,
> >         Hoping some solr guru can help
> > me out here. We are a news
> > organization trying to migrate 10 million documents from
> > FAST to solr. The
> > plan is to have our Editorial team add/modify synonyms
> > multiple times during
> > a day as they deem appropriate. Hence we plan on using
> > query time synonyms
> > as we cannot reindex every time they modify the synonyms
> > file(for the
> > entities extracted by OpenNLP like
> > locations/organizations/person names from
> > article body) . Since the synonyms are for names Iam
> > concerned that the
> > multi-phrase issue crops up with the query-time synonyms.
> > for example
> > synonyms could be as follows
> >
> > The Washington Post Co., The Washington Post, Washington
> > Post, The Post,
> > TWP, WAPO
> > DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland
> > Security
> > USCIS, United States Citizenship and Immigration Services,
> > U.S.C.I.S.
> >
> > Barack Obama,Barack H. Obama,Barack Hussein Obama,President
> > Obama
> > Hillary Clinton,Hillary R. Clinton,Hillary Rodham
> > Clinton,Secretary
> > Clinton,Sen. Clinton
> > William J. Clinton,William Jefferson Clinton,President
> > Clinton,President
> > Bill Clinton
> >
> > Virginia, Va., VA
> > D.C,Washington D.C, District of Columbia
> >
> > I have the following fieldType in schema.xml for the
> > keywords/entites...What
> > issues should I be aware off ? And is there a better way to
> > achieve it
> > without having to reindex a million docs on each synonym
> > change. NOTE that I
> > use tokenizerFactory="solr.KeywordTokenizerFactory" for
> > the
> > SynonymFilterFactory to keep the words intact without
> > splitting
> >
> >     <!--  Field Type Keywords/Entities
> > Extracted from OpenNLP -->
> >     <fieldType name="keywordText"
> > class="solr.TextField"
> > sortMissingLast="true" omitNorms="true"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"/>
> >
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.KeywordTokenizerFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt,entity-stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >         <filter
> > class="solr.SynonymFilterFactory"
> > tokenizerFactory="solr.KeywordTokenizerFactory"
> >
> synonyms="person-synonyms.txt,organization-synonyms.txt,location-synonyms.txt,subject-synonyms.txt"
> > ignoreCase="true" expand="true" />
> >         <filter
> > class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
>
> Have ever used this fieldType? Search on this field will be troublesome.
> You need to search exactly same entries as in your synonym.txt. Additional
> you need to use raw or field query parser. Because query text is spitted at
> white-spaces before it reaches KeywordTokenizer.
>
> For example:  q=keywordText:(Washington Post Bill Clinton)&debugQuery=on
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message