lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Floyd Wu <floyd...@gmail.com>
Subject Re: BM25 model for solr 4?
Date Fri, 16 Nov 2012 05:28:20 GMT
Thanks everyone, especially to Tom, you do give me detailed explanation
about this topic.
Of course in academic we do need to interpret result carefully, what I care
about is from end-users point of view, using BM25 will result better
ranking instead of using lucene's original VSM+Boolean model? How
significant difference will be presented?
I'd like to see some sharing from community.

Floyd


2012/11/16 Tom Burton-West <tburtonw@umich.edu>

> Hello Floyd,
>
> There is a ton of research literature out there comparing BM25 to vector
> space.  But you have to be careful interpreting it.
>
> BM25 originally beat the SMART vector space model in the early  TRECs
>  because it did better tf and length normalization.  Pivoted Document
> Length normalization was invented to get the vector space model to catch up
> to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
> now chief of Google Search did his doctoral thesis on this and it is
> available.  Similarly Stephan Robertson, now at Microsoft Research
> published a ton of studies of BM25)
>
> The default Solr/Lucene similarity class doesn't provide the length or tf
> normalization tuning params that BM25 does.  There is the sweetspot
> simliarity, but that doesn't quite work the same way that the BM25
> normalizations do.
>
> Document length normalization needs and parameter tuning all depends on
> your data.  So if you are reading a comparison, you need to determine:
> 1) When comparing recall/precision etc. between vector space and Bm25, did
> the experimenter tune both the vector space and the BM25 parameters
> 2) Are the documents (and queries) they are using in the test, similar in
>  length characteristics to your documents and
> queries.
>
> We are planning to do some testing in the next few months for our use case,
> which is 10 million books where we index the entire book.  These are
> extremely long documents compared to most IR research.
> I'd love to hear about actual (non-research) production implementations
> that have tested the new ranking models available in Solr.
>
> Tom
>
>
>
> On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu <floyd.wu@gmail.com> wrote:
>
> > Hi there,
> > Does anybody can kindly tell me how to setup solr to use BM25?
> > By the way, are there any experiment or research shows BM25 and classical
> > VSM model comparison in recall/precision rate?
> >
> > Thanks in advanced.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message