phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <andrew.purt...@gmail.com>
Subject Re: Time for 5.1.2?
Date Mon, 31 May 2021 17:13:02 GMT
The design is a product of its time and is due for an update. It would be better if the admin
APIs triggered and monitored a procedure that established a barrier at the RS cache update,
so a grant or revoke becomes a synchronous operation and the client knows when the call returns
the propagation of the updates are complete. However this is open source and we are reliant
on volunteers to do the work, and if it isn’t critical to someone or easy to do it is unlikely
to happen soon.


> On May 31, 2021, at 10:02 AM, Viraj Jasani <vjasani@apache.org> wrote:
> 
> 
>> 
>> This stuff was designed in years before the Procedure framework was
> written. What we had to work with is ZK watchers.
> 
> I see, this aligns with what I got to know today about the workflow. Yes,
> w.r.t ZK watchers, there is no change b/ HBase 1 and 2. Thanks Andrew for
> the clarification!
> 
>> The propagation of cache updates could be made synchronous with the
> update itself if the grant and revoke admin apis were refactored as
> Procedures. The Procedure would have a sub-procedure that establishes a
> barrier around the cache refresh at all live regionservers, presumably.
> There is a JIRA for that work filed somewhere, an old one.
> 
> Yes, I saw the Jira as well today, HBASE-21602. Good to see significant no
> of sub-tasks are already completed.
> But overall I understand that since this is more of an operator initiated
> action, I don't think there is a problem with existing design. The only
> thing I wanted to make sure, HBase 1 and 2 don't have any behavioural
> change in the workflow (other than HBase 1 using coproc endpoint directly
> on RS vs HBase 2 using Admin APIs).
> 
>> On Mon, May 31, 2021 at 10:07 PM Andrew Purtell <andrew.purtell@gmail.com>
>> wrote:
>> 
>> In the HBase security unit tests we wait for the zk mediated cache reload
>> to complete before continuing the test. If you don’t do that your test will
>> be flaky. Check that you are doing this. If you are not, then your tests
>> will be flaky. In the real world, the operator issues the grant or revoke
>> command and we have the human time scale of ~seconds (at least) for the
>> change to propagate. This is not a bug. This stuff was designed in years
>> before the Procedure framework was written. What we had to work with is ZK
>> watchers. Those are processed asynchronously with respect to the admin API.
>> Because a unit test runs much faster than a human administrative act, you
>> have to wait for more than just the admin API to complete. The cache in
>> each regionserver’s access controller has a last update timestamp and you
>> should wait for them to all update. Look at the HBase unit tests to see how
>> that is done.
>> 
>> The propagation of cache updates could be made synchronous with the update
>> itself if the grant and revoke admin apis were refactored as Procedures.
>> The Procedure would have a sub-procedure that establishes a barrier around
>> the cache refresh at all live regionservers, presumably. There is a JIRA
>> for that work filed somewhere, an old one. It’s not been important enough
>> for someone to pick up and perform. Don’t expect this to change.
>> 
>> 
>>>> On May 31, 2021, at 6:52 AM, Viraj Jasani <vjasani@apache.org> wrote:
>>> 
>>> For Permission APIs grant/revoke, I just checked the diff b/ HBase 1 and
>>> HBase 2.
>>> Phoenix MetaDataClient uses AccessControlClient to grant/revoke
>> permission
>>> to a user. HBase 1 AccessControlClient invokes coproc
>>> endpoint AccessController, which puts records in hbase:acl table on
>>> RegionServer hosting acl's single region. HBase 2 AccessControlClient on
>>> the other hand invokes Admin APIs to let master take care of adding
>> records
>>> in hbase:acl table (HBASE-21739). This is the main diff b/ both versions,
>>> however, both workflows seem synchronous.
>>> For both HBase 1 and 2, post-put hook updates ZNodes with updated entries
>>> in acl table, which internally triggers a refresh of table cache in
>>> AuthManager, and this table cache could be used to verify Auth of user on
>>> table/namespace in subsequent request (based on our test case: remove
>>> access, followed by verification of grant permission). If table cache has
>>> race condition, it should also be applicable to HBase 1 as well I
>> believe,
>>> yet to compare both at length.
>>> 
>>> I understand, Phoenix is not doing anything different w.r.t GRANT/REVOKE
>>> statements so some sync issue might have been introduced in HBase 2
>>> unknowingly. I will try to dig more sometime next week and file HBase
>> Jira
>>> if I could find some evidence of race conditions or could reproduce with
>>> HBase tests directly.
>>> 
>>>> On Mon, May 31, 2021 at 11:40 AM Istvan Toth <stoty@apache.org> wrote:
>>>> 
>>>> I agree that we are ready for the RC, and that deflaking the tests is
>>>> great.
>>>> 
>>>> However, I think that Permissions test failures are exposing a real
>> problem
>>>> (probably in HBase),
>>>> and we've only adopted the tests for the buggy behaviour of HBase, and
>>>> should track down / fix
>>>> the root cause.
>>>> 
>>>> On Sun, May 30, 2021 at 4:46 PM Viraj Jasani <vjasani@apache.org>
>> wrote:
>>>> 
>>>>>> IMHO, a flapper is worse than a failing test because it undermines
the
>>>>> confidence in a test run.
>>>>> 
>>>>> I concur. In fact, in the last few builds, I have seen a few flappers
>>>> (e.g
>>>>> testUpsertSelectWithMultiByteCharsAutoCommit, testAsyncRebuildAll)
>> whose
>>>>> resolution was attempted just a few months back and now they are
>> showing
>>>> up
>>>>> again (with much lower frequency than before), so yes we need some
>>>>> attention again.
>>>>> 
>>>>> After fixing consistent failure with testPherfMain (PHOENIX-6482) and
>>>> high
>>>>> frequency flappers with BasePermissionsIT and AuditLoggingIT
>>>>> (PHOENIX-6483), CI builds are in better shape now.
>>>>> In the recent build, I can see no test failures with 3/4 HBase
>> profiles.
>>>> I
>>>>> will file a Jira for JaCoCo report generation failure, which keeps
>>>> showing
>>>>> up often (flapper with JaCoCo issue, not related to test failure).
>>>>> 
>>>>> Given that CI builds are in better shape, planning to start an RC soon.
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> On Sat, May 29, 2021 at 9:45 PM larsh@apache.org <larsh@apache.org>
>>>> wrote:
>>>>> 
>>>>>> Nice!
>>>>>> 
>>>>>> I noticed we had some successful 5.1 runs too.
>>>>>> Fixing the tests would need some attention again. Flappers should
>>>> either
>>>>>> be fixed or disabled.
>>>>>> 
>>>>>> IMHO, a flapper is worse than a failing test because it undermines
the
>>>>>> confidence in a test run.
>>>>>> If the tests always pass you'll notice a failure right away.
>>>>>> But when most of the runs produce some kind of failure it is human
>>>> nature
>>>>>> to start ignoring them, and then new, real failures pile up.
>>>>>> 
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> On Friday, May 28, 2021, 11:36:53 PM PDT, Viraj Jasani <
>>>>> vjasani@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Due to incorrect/missing fix versions, found some discrepancies b/
>>>> master
>>>>>> and 5.1 in the Pherf module.
>>>>>> Backported recent improvements to 5.1 including test fix:
>>>>>> PHOENIX-6417, PHOENIX-6118, PHOENIX-6430, PHOENIX-6429, PHOENIX-6431,
>>>>>> PHOENIX-6432
>>>>>> 
>>>>>> On Sat, May 29, 2021 at 12:20 AM Viraj Jasani <vjasani@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>>> I looked at CI build today and there are a couple of flakies
in
>>>>>> core/pherf.
>>>>>>> 
>>>>>>> 3 major flaky tests: PermissionNSEnabledIT, PermissionsCacheIT,
>>>>>>> AuditLoggingIT (reported with HBase 2.2/2.3/2.4). Permission
tests
>>>> have
>>>>>>> been flaky for quite some time. IIRC, there was some attempt
to fix
>>>>> them
>>>>>> as
>>>>>>> well but still they appear with AccessDeniedException at times.
Not
>>>>> much
>>>>>>> sure about AuditLoggingIT though.
>>>>>>> 
>>>>>>> Pherf test fails with HBase 2.1.
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 28 May 2021 at 10:44 PM, larsh@apache.org <larsh@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 5.1 CI builds seems to fail consistently - at least for the
past two
>>>>>>>> weeks.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-mulitbranch/job/5.1/
>>>>>>>> 
>>>>>>>> Might be something to look into.
>>>>>>>> I have an hour today between meetings, I'll see what I can
do. :)
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wednesday, May 26, 2021, 10:54:13 PM PDT, Viraj Jasani
<
>>>>>>>> vjasani@apache.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks Lars. I had an offline chat with Istvan yesterday
and I will
>>>>>>>> prepare
>>>>>>>> RCs over the weekend.
>>>>>>>> 
>>>>>>>> In the meantime, I will again take a look at Jira fixVersion
and git
>>>>>>>> commits compatibility and also try to help with any backport
(if
>>>> still
>>>>>>>> pending).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, 26 May 2021 at 11:53 PM, larsh@apache.org <larsh@apache.org
>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I cherry-picked three changes into branch 5.1.
>>>>>>>>> Assuming Jira is up-to-date, there are no 4.16.x issues
that are
>>>> not
>>>>>>>> also
>>>>>>>>> in 5.1.x.
>>>>>>>>> 
>>>>>>>>> There are 12 issues left that are in 4.17.0 and 5.2.0,
but not in
>>>>>> 5.1.x.
>>>>>>>>> But they are presumably not in 4.16.x because they violate
>>>> backward
>>>>>>>>> compatibility, and - if so - should not be in 5.1.2.
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> 
>>>>>>>>> On Wednesday, May 26, 2021, 10:15:31 AM PDT, larsh@apache.org
<
>>>>>>>>> larsh@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Awesome. Thanks Istvan.
>>>>>>>>> 
>>>>>>>>> I'll do a pass through the issues too.
>>>>>>>>> 
>>>>>>>>> -- Lars
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tuesday, May 25, 2021, 8:45:00 PM PDT, Istvan Toth
<
>>>>>> stoty@apache.org
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I am not aware of any blockers in 5.1.
>>>>>>>>> If there is no objection, and no other volunteers for
the 5.1.2 RM
>>>>>>>> role, I
>>>>>>>>> intend to start the process on Monday.
>>>>>>>>> 
>>>>>>>>> I will also look for missing fixes before release, but
if anyone
>>>> is
>>>>>>>> aware
>>>>>>>>> of any fix that is needed in 5.1.2 , but
>>>>>>>>> hasn't been backported, then please try to get it resolved
by
>>>>> Monday.
>>>>>>>>> 
>>>>>>>>> regards
>>>>>>>>> Istvan
>>>>>>>>> 
>>>>>>>>> On Wed, May 26, 2021 at 4:19 AM larsh@apache.org <
>>>> larsh@apache.org>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> There are 17 resolved issues against it.
>>>>>>>>>> PHOENIX-6436 is needed for further integration with
Trino
>>>>> (formerly
>>>>>>>>>> Presto).
>>>>>>>>>> 
>>>>>>>>>> It is time for another point release?
>>>>>>>>>> 
>>>>>>>>>> -- Lars
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 

Mime
View raw message