lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <>
Subject [jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits
Date Wed, 24 Dec 2014 21:49:14 GMT


Shalin Shekhar Mangar commented on SOLR-6810:

bq. The main idea seems to be: you don't need IDs to merge the top docs from each shard. Correct?

Yes, exactly.

bq. I'm still not quite groking it though... do you understand it well enough to give a high
level description for those who know Solr but who haven't looked at the patch?

The idea is to:
# Get score for top N docs from each shard in the first pass, (say rows=3 and shard1 returns
scores 0.8, 0.5, 0.3 and shard2 returns 0.9, 0.6, 0.1)
# Merge them together to find the top N scores (0.9, 0.8, 0.6) and track number of results
from each shard in top N scores (shard1 has 1 docs in top 3 and shard2 has 2 doc in top 3)
# Get corresponding docs (id and all return fields) from each shard in the second pass. (retrieve
top 1 docs from shard1 and top 2 doc from shard2)

bq. As in... what's the high level description of what this patch implements?

The patch implements this algorithm of course. It makes the algorithm configurable using a
new 'dqa' parameter. There are some refactorings in ShardParams, ResponseBuilder to make this
work. There are good randomized tests such that all Solr tests switch between the new and
old algorithms. The patch also adds wrapper classes for SolrCore, SolrIndexSearcher and LeafReader
which are used only during tests to assert things like number of shard requests, number of
stored field accesses etc.

bq. Also, does this patch also improve things if docValues are used for the ID field?


> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>                 Key: SOLR-6810
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, branch_5x_rev1645549.patch
> Searching "limited but high rows across many shards all with many hits" is slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among the top-1000
from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant number of hits
on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on a lot of
things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the store to
get id’s for (up to) 1000 docs (value of rows parameter) per shard. Having 1000 shards its
up to 1 mio ids that has to be fetched. There is really no good reason to ever read information
from store for more than the overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows across many
shards all with high hits" started 13/11-2014 on

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message