lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config
Date Fri, 31 May 2019 20:49:11 GMT
Hoss, I see several of these failures popping up, probably related to timing of the config
reload across nodes. Should we as a phase 1 introduce a simple sleep to harden those tests
and follow up later with APIs that support waiting until config propagates?

Jan Høydahl

> 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <jira@apache.org>:
> 
> 
>    [ https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697
] 
> 
> Hoss Man commented on SOLR-13464:
> ---------------------------------
> 
> In theory it would be possible for a test client (or any real production client) to poll
{{/admin/auth...}} on all/any nodes in a cluster to verify that they are using the updated
security settings, because the behavior of SecurityConfHandlerZk on GET is to read the _cached_
security props from the ZkStateReader, so in theory it's only updated once it's been force
refreshed by the zk watcher ... but this still has 2 problems:
> # any client doing this would have to be statefull and know what the most recent setting(s)
change was, so it could assert those specific settings have been updated. There's no way for
a "dumb" client to simply ask "is your current config up to date w/zk". Even if the client
directly polled ZK to see what the current version is in the authoritative {{/security.json}}
for the cluster, the "version" info isn't included in the {{GET /admin/auth...}} responses,
so it would have to do a "deep comparison" of the entire JSON response.
> # even if client knows what data to expect from a {{GET /admin/auth...}} request when
polling all/any nodes in the cluster (either from first hand knowledge because it was the
client that did the last POST, or second hand knowledge from querying ZK directly) and even
if the expected data is returned by every node, that doesn't mean it's in *USE* yet – there
is inherient lag between when the security conf data is "refreshed" in the ZkStateReader (on
each node) and when the plugin Object instance are actually initialized and become active
on each node.
> 
> ----
> Here's a strawman proposal for a possible solution to this problem – both for use in
tests, and for end users, that might want to verify when updated settings are in really enabled...
> # refactor CoreContainer so that methods like {{public AuthorizationPlugin getAuthorizationPlugin()}}
are deprecated/syntactic sugar for new {{public SecurityPluginHolder<AuthorizationPlugin>
getAuthorizationPlugin()}} methods so that callers can read the znode version used to init
the plugin
> # refactor {{SecurityConfHandler.getPlugin(String)}} to be a deprecated/syntactic sugar
for a new version that returns {{SecurityPluginHolder<?>}}
> # update {{SecurityConfHandlerZk.getConf}} so that it:
> ** uses {{getSecurityConfig(true)}} to ensure it reads the most current settings from
ZK, (instead of the cached copy used by the current code).
> ** adds the {{SecurityConfig.getVersion()}} number in the response (in addition to the
config data) ... perhaps as {{key + ".conf.version"}}
> ** when {{getPlugin(key)}} is non null, include the {{SecurityPluginHolder.getVersion()}}
in the response ... perhaps as {{key + ".enabled.version"}}
> 
> ...that way a dumb client can easily poll any/all node(s) for {{/admin/auth_foo}} until
the {{auth_foo.conf.version}} and {{auth_foo.enabled.version}} are identical to know when
the most recent {{auth_foo}} settings in ZK's security.json are actaully in use.
> 
> (We could potentially take things even a step further, and add something like a {{verify.cluster.version=true|false}}
option to SecurityConfHandlerZk, that would federate {{GET /admin/auth...}} to every (live?)
node in the cluster, and include map of nodeName => enabled.version for every node ...
maybe?)
> 
> Thoughts?
> 
>> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security
config
>> -------------------------------------------------------------------------------------------
>> 
>>                Key: SOLR-13464
>>                URL: https://issues.apache.org/jira/browse/SOLR-13464
>>            Project: Solr
>>         Issue Type: Bug
>>     Security Level: Public(Default Security Level. Issues are Public) 
>>           Reporter: Hoss Man
>>           Priority: Major
>> 
>> I've been investigating some sporadic and hard to reproduce test failures related
to authentication in cloud mode, and i *think* (but have not directly verified) that the common
cause is that after uses one of the {{/admin/auth...}} handlers to update some setting, there
is an inherient and unpredictible delay (due to ZK watches) until every node in the cluster
has had a chance to (re)load the new configuration and initialize the various security plugins
with the new settings.
>> Which means, if a test client does a POST to some node to add/change/remove some
authn/authz settings, and then immediately hits the exact same node (or any other node) to
test that the effects of those settings exist, there is no garuntee that they will have taken
affect yet.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message