lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Steffensen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits
Date Sat, 27 Dec 2014 18:36:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259447#comment-14259447
] 

Per Steffensen edited comment on SOLR-6810 at 12/27/14 6:35 PM:
----------------------------------------------------------------

TestDistributedQueryAlgorithm.testDocReads shows very well exactly how the number of store
accesses is reduced
{code}
// Test the number of documents read from store using FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// vs FIND_ID_RELEVANCE_FETCH_BY_IDS. This demonstrates the advantage of FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// over FIND_ID_RELEVANCE_FETCH_BY_IDS (and vice versa)
private void testDocReads() throws Exception {
  for (int startValue = 0; startValue <= MAX_START; startValue++) {
    // FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS (assuming skipGetIds used - default)
    // Only reads data (required fields) from store for "rows + (#shards * start)" documents
across all shards
    // This can be optimized to become only "rows" 
    // Only reads the data once
    testDQADocReads(ShardParams.DQA.FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS, startValue,
ROWS, ROWS + (startValue * jettys.size()), ROWS + (startValue * jettys.size()));

    // DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS (assuming skipGetIds not used - default)
    // Reads data (ids only) from store for "(rows + startValue) * #shards" documents for
each shard
    // Besides that reads data (required fields) for "rows" documents across all shards
    testDQADocReads(ShardParams.DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS, startValue, ROWS, (ROWS
+ startValue) * jettys.size(), ROWS + ((ROWS + startValue) * jettys.size()));
  }
}
{code}
{code}
testDQADocReads(ShardParams.DQA dqa, int start, int rows, int expectedUniqueIdCount, int expectedTotalCount)
{
...
}
{code}


was (Author: steff1193):
TestDistributedQueryAlgorithm.testDocReads shows very well exactly how the number of store
accesses is reduced
{code}
// Test the number of documents read from store using FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// vs FIND_ID_RELEVANCE_FETCH_BY_IDS. This demonstrates the advantage of FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS
// over FIND_ID_RELEVANCE_FETCH_BY_IDS (and vice versa)
private void testDocReads() throws Exception {
  for (int startValue = 0; startValue <= MAX_START; startValue++) {
    // FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS (assuming skipGetIds used - default)
    // Only reads data (required fields) from store for "rows + (#shards * start)" documents
across all shards
    // This can be optimized to become only "rows" 
    // Only reads the data once
    testDQADocReads(ShardParams.DQA.FIND_RELEVANCE_FIND_IDS_LIMITED_ROWS_FETCH_BY_IDS, startValue,
ROWS, ROWS + (startValue * jettys.size()), ROWS + (startValue * jettys.size()));

    // DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS (assuming skipGetIds not used - default)
    // Reads data (ids only) from store for "(rows + startValue) * #shards" documents for
each shard
    // Besides that reads data (required fields) for "rows" documents across all shards
    testDQADocReads(ShardParams.DQA.FIND_ID_RELEVANCE_FETCH_BY_IDS, startValue, ROWS, (ROWS
+ startValue) * jettys.size(), ROWS + ((ROWS + startValue) * jettys.size()));
  }
}
{code}

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among the top-1000
from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant number of hits
on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on a lot of
things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the store to
get id’s for (up to) 1000 docs (value of rows parameter) per shard. Having 1000 shards its
up to 1 mio ids that has to be fetched. There is really no good reason to ever read information
from store for more than the overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows across many
shards all with high hits" started 13/11-2014 on dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message