lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Steffensen (JIRA)" <>
Subject [jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits
Date Tue, 30 Dec 2014 07:28:13 GMT


Per Steffensen commented on SOLR-6810:

I think the strategy that Shalin & I were talking about as a potential default was one
that never collected IDs separately, hence no extra round-trip.
step 1: retrieve sort field values (then merge and calculate the range of ordinals needed
for each shard)
step 2: retrieve stored fields by specifying the ordinals from each shard

It sounds like third DQA - does not seem to be exactly what my new algorithm does. But you
suggestion still has 2 round-trips. The old/current default-DQA with dqa.forceSkipGetIds/distrib.singlePass=false
(default) has 2 round-trips. My new DQA with dqa.forceSkipGetIds=true (default) has 2 round-trips,
so choosing that as the new default-DQA will not introduce an extra round-trip compared to
todays default-DQA. But we could choose to select old/current default-DQA with dqa.forceSkipGetIds/distrib.singlePass=true
as the new default-DQA. It has only 1 round-trip, so compared to that both your DQA (above)
and my new DQA has an extra round-trip. My new DQA will never become 1-round-trip only, because
the essence is to make a round-trip first (inexpensive because you do not access store - not
even for ids), collecting information needed to limit "rows" for the second round-trip where
you actually retrieve the stored fields.

bq. one that never collected IDs separately

My new DQA does not collect IDs separately (as long as you do not explicitly set dqa.forceSkipGetIds=false)

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>                 Key: SOLR-6810
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, branch_5x_rev1645549.patch
> Searching "limited but high rows across many shards all with many hits" is slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among the top-1000
from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant number of hits
on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on a lot of
things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the store to
get id’s for (up to) 1000 docs (value of rows parameter) per shard. Having 1000 shards its
up to 1 mio ids that has to be fetched. There is really no good reason to ever read information
from store for more than the overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows across many
shards all with high hits" started 13/11-2014 on

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message