lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: Cores and and ranking (search quality)
Date Tue, 10 Mar 2015 17:58:40 GMT
On 3/10/2015 11:17 AM, wrote:
> If I have two cores, one core has 10 docs another has 100,000 docs.  I then submit two
docs that are 100% identical (with the exception of the unique-ID fields, which is stored
but not indexed) one to each core.  The question is, during search, will both of those docs
rank near each other or not?  If so, this is great because it will behave the same as if I
had one core and index both docs to this single core.  If not, which core's doc will rank
higher and how far apart the two docs be from each other in the ranking?
> Put another way: are docs from the smaller core (the one has 10 docs only) rank higher
or lower compared to docs from the larger core (the one with 100,000) docs?

Without specific knowledge about the document in question as well as all
the other documents, this is impossible to answer, except to say that
the relative ranking position is likely to be different.  Dropping back
to general info:

The overall term frequency and inverse document frequency (TF-IDF) in
the 100,000 document index will very likely be quite a lot different
than in the 10 document index.  That will affect ranking order. 
Sometimes users are surprised by the results they get, but it is very
rare to find a bug in Lucene scoring.

In addition to the debug parameter that Erick told you about, here are a
couple of classes you could investigate at the source code level for
more information about ranking:

Here's info that is more general, and from a much earlier Lucene version:

I have my Solr install configured to use the BM25 similarity.

SOLR-1632 aims to make TF-IDF the same across multiple cores as you
would get if you only had one core.  I do not know enough about it to
know whether it is EXACTLY the same, or only an approximation ... but in
a search context, 100 percent precise calculation is rarely required. 
When you drop that as a requirement, search becomes easier and a LOT faster.


View raw message