lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-8907) add features to MiniSolrCloudCluster to make shard/leader/replica placement more reproducible
Date Fri, 25 Mar 2016 22:23:25 GMT

    [ https://issues.apache.org/jira/browse/SOLR-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212498#comment-15212498
] 

Hoss Man commented on SOLR-8907:
--------------------------------


The motivation for creating this issue came out of a situation i noticed while working on
SOLR-445.

The goal was to test that updates were working reliably regardless of if what node they were
routed to.

The test, in a nutshell, looked like this...

{code}
// tests setup...
cluster.createCollection(...);
CLOUD_CLIENT = cluster.getSolrClient();
NODE_CLIENTS = new ArrayList<SolrClient>(numServers);
for (JettySolrRunner jetty : cluster.getJettySolrRunners()) {
  URL jettyURL = jetty.getBaseUrl();
  NODE_CLIENTS.add(new HttpSolrClient(jettyURL.toString() + "/" + COLLECTION_NAME + "/"));
}


// in a loop...
SolrRequest req = makeRandomUpdateRequest(random());
SolrClient client = random().nextBoolean() ? CLOUD_CLIENT
   : NODE_CLIENTS.get(TestUtil.nextInt(random(), 0, NODE_CLIENTS.size()-1));
}
assertSomeStuffAboutResponse(req.process(client));
{code}

There was a bug in the code such that in some specific situations (based on the output of
{{makeRandomUpdateRequest(...)}}) updates meeting certain criteria would fail _unless_ they
were sent to the leader of a particular shard (particular because it was the leader for all
the Ids generated by {{makeRandomUpdateRequest(...)}} in that particular loop iteration)

This meant that there were particular seeds that _most of the time_ would reliably reproduce,
but roughly every {{1 / numServer}} number of attempts, the leader for the particular shard
in question would randomly be assigned to the jetty instance whose httpSolrClient was randomly
(but consistently for this seed) being selected at this point.

That made the test far more confusing to try and debug then if the leaders for the shards
were being consistently assigned to the same jetty nodes (relative to their ordering in the
list returned by {{cluster.getJettySolrRunners()}}) ... like how older, pre-cloud, distributed
update tests use to work.

In short: given a fixed seed, the test code was doing everything in it's power to be 100%
consistent w/ the requests it generated and the jetty nodes those requests were sent to --
but the test still wasn't very reproducible because of the shard & leader assignments
were random.

----

I suspect that the best way to try and implement something like this would be to use [rule
based replica placement|https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement]
feature -- perhaps with a special "Snitch" designed for use in MiniSolrCloudCluster tests?
... But i'm not really sure how it would work because i don't really understand how to use
/ extend that feature.


So assuming for the sake of argument that it's not possible using the rule based placement
stuff, here's a description of the approach that initially ocured to me to serve as a straw
man for discussion...

* If it's not already, {{MiniSolrCloudCluster}} should ensure every Jetty instance is started
up with a consistent node name (sequentially numbered or whatever)
* If it's not already, {{MiniSolrCloudCluster.getJettySolrRunners()}} should return the jetty
instances in a consistently sorted order (based on something like node name -- not something
non-deterministic like the port#, or order that they started up)
* {{MiniSolrCloudCluster.createCollection(...)}} (or some new method with a similar signature)
should be changed to more explicitly do a lot of work currently done implicitly by the {{CREATE}}
API call...
** use the {{shards}} param to provide explicitly generated names for every shard 
** use the {{createNodeSet=EMPTY}} param
** Once the collection is created (w/o any replicas)...
*** {{ADDREPLICA}} and {{ADDREPLICAPROP}} should be used explicitly to create a preferedLeader
for each (named) {{shard}} and assign it to a predictably chosen {{node}} (by name).
*** Additional {{ADDREPLICA}} calls should then be made as needed to add the expected number
of replicas for each {{shard}} on predictably chosen {{node}}s (by name).
* {{MiniSolrCloudCluster}} could then support some new convenience methods for tests to use:
** Things like...
*** {{List<HttpSolrClient> getClientsForAllReplicas(String collectionName)}}
*** {{List<HttpSolrClient> getClientsForShard(String collectionName, String shardName)}}
*** {{SortedMap<String,HttpSolrClient> getClientsForLeaders(String collectionName) //
keyed by shardName}}
*** {{HttpSolrClient getClientForLeader(String collectionName, String shardName)}}
** These methods should do a "live" lookup of the data current in ZK, so that even if a test
shuts down nodes, or adds replicas, or triggers some bit of chaos they can still subsequently
lookup a useful SolrClient to test some action with
** Obviously these methods should return all clients in a consistent order (ie: sort by core
node name)
** (See {{TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster()}} for some sample
code of building up SolrClients targeting shard leaders)



...what do folks think?

is this possible/easy using a custom "snitch" ?

> add features to MiniSolrCloudCluster to make shard/leader/replica placement more reproducible
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8907
>                 URL: https://issues.apache.org/jira/browse/SOLR-8907
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> I think MiniSolrCloudCluster would be greatly improved if (by default) collections created
for test purposes had predictable shard/leader/core assignment across the jetty instances
that are spun up.  Even though the port#s used by the jettys will obviously vary every time
a test is run, ideally a given seed should ensure that the following are all consistent:
> * the node_name used by each JettySolrRunner
> * which nodes host which shards
> * the core names use on each jetty instance
> * which core is the leader for each shard
> Obviously this wouldn't make sense for tests where the entire purpose is to ensure that
the automatic assignment of these things works properly when creating a collection, or when
explicitly testing things like "preferedLeader", but for tests of non-collection API related
features (ie: update requests, search requests, sorting, etc...) where the test setup already
takes advantage of methods like {{MiniSolrCloudCluster.createCollection(...)}} as a short
cut to using the API directly, this type of consistency would make potential test failures
a lot more reproducible && easier to diagnose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message