lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits
Date Wed, 24 Dec 2014 16:50:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258369#comment-14258369
] 

Shalin Shekhar Mangar edited comment on SOLR-6810 at 12/24/14 4:49 PM:
-----------------------------------------------------------------------

Thanks Per. This is great. I'm still going through the patch in detail but I have a few questions
and comments.

{code}
     * Algorithm
     * - Shard-queries 1) Ask, by forwarding the outer query, each shard for relevance of
the (up to) #rows most relevant matching documents
     * - Find among those relevances the #rows highest global relevances
     * Note for each shard (S) how many entries (docs_among_most_relevant(S)) it has among
the #rows globally highest relevances
     * - Shard-queries 2) Ask, by forwarding the outer query, each shard S for id and relevance
of the (up to) #docs_among_most_relevant(S) most relevant matching documents
     * - Find among those id/relevances the #rows id's with the highest global relevances
(lets call this set of id's X)
     * - Shard-queries 3) Ask, by sending id's, each shard to return the documents from set
X that it holds
     * - Return the fetched documents to the client 
{code}

Since dqa.forceSkipGetIds is always true for this new algorithm then computing the set X is
not necessary and we can just directly fetch all return fields from individual shards and
return the response to the user. Is that correct?

I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We can just keep
a single static ShardParams.getDQA(SolrParams params) method and modify it if we ever need
to change the default. If a user wants to change the default, the dqa can be set in the "defaults"
section of the search handler.

Why do we need the switchToTestDQADefaultProvider() and switchToOriginalDQADefaultProvider()
methods? You are already applying the DQA for each request so why is the switch necessary?

There's still the ShardParams.purpose field which you added in SOLR-6812 but I removed it.
I still think it is unnecessary for purpose to be sent to shard. Is it necessary for this
patch or is it just an artifact from SOLR-6812?

Did you benchmark it against the current algorithm for other kinds of use-cases as well (3-5
shards, small number of rows)? Not asking for id can speed up responses there too I think.

{quote}
"all with many hits" means that each of the shards have a significant number of hits on the
query
{quote}

Unless I missed something, the algorithm has no effect with respect to how many docs are hit
by query on each shard?


was (Author: shalinmangar):
Thanks Per. This is great. I'm still going through the patch in detail but I have a few questions
and comments.

{code}
     * Algorithm
     * - Shard-queries 1) Ask, by forwarding the outer query, each shard for relevance of
the (up to) #rows most relevant matching documents
     * - Find among those relevances the #rows highest global relevances
     * Note for each shard (S) how many entries (docs_among_most_relevant(S)) it has among
the #rows globally highest relevances
     * - Shard-queries 2) Ask, by forwarding the outer query, each shard S for id and relevance
of the (up to) #docs_among_most_relevant(S) most relevant matching documents
     * - Find among those id/relevances the #rows id's with the highest global relevances
(lets call this set of id's X)
     * - Shard-queries 3) Ask, by sending id's, each shard to return the documents from set
X that it holds
     * - Return the fetched documents to the client 
{code}

Since dqa.forceSkipGetIds is always true for this new algorithm then computing the set X is
not necessary and we can just directly fetch all return fields from individual shards and
return the response to the user. Is that correct?

I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We can just keep
a single static ShardParams.getDQA(SolrParams params) method and modify it if we ever need
to change the default. If a user wants to change the default, the dqa can be set in the "defaults"
section of the search handler.

Why do we need the switchToTestDQADefaultProvider() and switchToOriginalDQADefaultProvider()
methods? You are already applying the DQA for each request so why is the switch necessary?

Did you benchmark it against the current algorithm for other kinds of use-cases as well (3-5
shards, small number of rows)? Not asking for id can speed up responses there too I think.

{quote}
"all with many hits" means that each of the shards have a significant number of hits on the
query
{quote}

Unless I missed something, the algorithm has no effect with respect to how many docs are hit
by query on each shard?

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among the top-1000
from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant number of hits
on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on a lot of
things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the store to
get id’s for (up to) 1000 docs (value of rows parameter) per shard. Having 1000 shards its
up to 1 mio ids that has to be fetched. There is really no good reason to ever read information
from store for more than the overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows across many
shards all with high hits" started 13/11-2014 on dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message