lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MitchK <>
Subject Re: a bug of solr distributed search
Date Fri, 23 Jul 2010 18:23:58 GMT


why do we do not send the output of TermsComponent of every node in the
cluster to a Hadoop instance?
Since TermsComponent does the map-part of the map-reduce concept, Hadoop
only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
After reducing, every node in the cluster gets the current values to compute
the idf.
We can store this information in a HashMap-based SolrCache (or something
like that) to provide constant-time access. To keep the values up to date,
we can repeat that after every x minutes.

If we got that, it does not care whereas we use doc_X from shard_A or
shard_B, since they will all have got the same scores. 

Even if we got large indices with 10 million or more unique terms, this will
only need some megabyte network-traffic.

Kind regards,
- Mitch

Yonik Seeley-2-2 wrote:
> As the comments suggest, it's not a bug, but just the best we can do
> for now since our priority queues don't support removal of arbitrary
> elements.  I guess we could rebuild the current priority queue if we
> detect a duplicate, but that will have an obvious performance impact.
> Any other suggestions?
> -Yonik
View this message in context:
Sent from the Solr - User mailing list archive at

View raw message