lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Cores and and ranking (search quality)
Date Tue, 10 Mar 2015 18:38:32 GMT
On Mar 10, 2015, at 10:17 AM, johnmunir@aol.com wrote:

> If I have two cores, one core has 10 docs another has 100,000 docs.  I then submit two
docs that are 100% identical (with the exception of the unique-ID fields, which is stored
but not indexed) one to each core.  The question is, during search, will both of those docs
rank near each other or not? […]
> 
> Put another way: are docs from the smaller core (the one has 10 docs only) rank higher
or lower compared to docs from the larger core (the one with 100,000) docs?

These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). With 10 docs,
the document frequency statistics are effectively random noise, so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or lower in the two
cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes see anomalies
under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


Mime
View raw message