phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <andrew.purt...@gmail.com>
Subject Re: Time for 5.1.2?
Date Mon, 31 May 2021 16:37:34 GMT
In the HBase security unit tests we wait for the zk mediated cache reload to complete before
continuing the test. If you don’t do that your test will be flaky. Check that you are doing
this. If you are not, then your tests will be flaky. In the real world, the operator issues
the grant or revoke command and we have the human time scale of ~seconds (at least) for the
change to propagate. This is not a bug. This stuff was designed in years before the Procedure
framework was written. What we had to work with is ZK watchers. Those are processed asynchronously
with respect to the admin API. Because a unit test runs much faster than a human administrative
act, you have to wait for more than just the admin API to complete. The cache in each regionserver’s
access controller has a last update timestamp and you should wait for them to all update.
Look at the HBase unit tests to see how that is done. 

The propagation of cache updates could be made synchronous with the update itself if the grant
and revoke admin apis were refactored as Procedures. The Procedure would have a sub-procedure
that establishes a barrier around the cache refresh at all live regionservers, presumably.
There is a JIRA for that work filed somewhere, an old one. It’s not been important enough
for someone to pick up and perform. Don’t expect this to change.


> On May 31, 2021, at 6:52 AM, Viraj Jasani <vjasani@apache.org> wrote:
> 
> For Permission APIs grant/revoke, I just checked the diff b/ HBase 1 and
> HBase 2.
> Phoenix MetaDataClient uses AccessControlClient to grant/revoke permission
> to a user. HBase 1 AccessControlClient invokes coproc
> endpoint AccessController, which puts records in hbase:acl table on
> RegionServer hosting acl's single region. HBase 2 AccessControlClient on
> the other hand invokes Admin APIs to let master take care of adding records
> in hbase:acl table (HBASE-21739). This is the main diff b/ both versions,
> however, both workflows seem synchronous.
> For both HBase 1 and 2, post-put hook updates ZNodes with updated entries
> in acl table, which internally triggers a refresh of table cache in
> AuthManager, and this table cache could be used to verify Auth of user on
> table/namespace in subsequent request (based on our test case: remove
> access, followed by verification of grant permission). If table cache has
> race condition, it should also be applicable to HBase 1 as well I believe,
> yet to compare both at length.
> 
> I understand, Phoenix is not doing anything different w.r.t GRANT/REVOKE
> statements so some sync issue might have been introduced in HBase 2
> unknowingly. I will try to dig more sometime next week and file HBase Jira
> if I could find some evidence of race conditions or could reproduce with
> HBase tests directly.
> 
>> On Mon, May 31, 2021 at 11:40 AM Istvan Toth <stoty@apache.org> wrote:
>> 
>> I agree that we are ready for the RC, and that deflaking the tests is
>> great.
>> 
>> However, I think that Permissions test failures are exposing a real problem
>> (probably in HBase),
>> and we've only adopted the tests for the buggy behaviour of HBase, and
>> should track down / fix
>> the root cause.
>> 
>> On Sun, May 30, 2021 at 4:46 PM Viraj Jasani <vjasani@apache.org> wrote:
>> 
>>>> IMHO, a flapper is worse than a failing test because it undermines the
>>> confidence in a test run.
>>> 
>>> I concur. In fact, in the last few builds, I have seen a few flappers
>> (e.g
>>> testUpsertSelectWithMultiByteCharsAutoCommit, testAsyncRebuildAll) whose
>>> resolution was attempted just a few months back and now they are showing
>> up
>>> again (with much lower frequency than before), so yes we need some
>>> attention again.
>>> 
>>> After fixing consistent failure with testPherfMain (PHOENIX-6482) and
>> high
>>> frequency flappers with BasePermissionsIT and AuditLoggingIT
>>> (PHOENIX-6483), CI builds are in better shape now.
>>> In the recent build, I can see no test failures with 3/4 HBase profiles.
>> I
>>> will file a Jira for JaCoCo report generation failure, which keeps
>> showing
>>> up often (flapper with JaCoCo issue, not related to test failure).
>>> 
>>> Given that CI builds are in better shape, planning to start an RC soon.
>>> Thanks
>>> 
>>> 
>>> On Sat, May 29, 2021 at 9:45 PM larsh@apache.org <larsh@apache.org>
>> wrote:
>>> 
>>>> Nice!
>>>> 
>>>> I noticed we had some successful 5.1 runs too.
>>>> Fixing the tests would need some attention again. Flappers should
>> either
>>>> be fixed or disabled.
>>>> 
>>>> IMHO, a flapper is worse than a failing test because it undermines the
>>>> confidence in a test run.
>>>> If the tests always pass you'll notice a failure right away.
>>>> But when most of the runs produce some kind of failure it is human
>> nature
>>>> to start ignoring them, and then new, real failures pile up.
>>>> 
>>>> 
>>>> -- Lars
>>>> 
>>>> 
>>>> On Friday, May 28, 2021, 11:36:53 PM PDT, Viraj Jasani <
>>> vjasani@apache.org>
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Due to incorrect/missing fix versions, found some discrepancies b/
>> master
>>>> and 5.1 in the Pherf module.
>>>> Backported recent improvements to 5.1 including test fix:
>>>> PHOENIX-6417, PHOENIX-6118, PHOENIX-6430, PHOENIX-6429, PHOENIX-6431,
>>>> PHOENIX-6432
>>>> 
>>>> On Sat, May 29, 2021 at 12:20 AM Viraj Jasani <vjasani@apache.org>
>>> wrote:
>>>> 
>>>>> I looked at CI build today and there are a couple of flakies in
>>>> core/pherf.
>>>>> 
>>>>> 3 major flaky tests: PermissionNSEnabledIT, PermissionsCacheIT,
>>>>> AuditLoggingIT (reported with HBase 2.2/2.3/2.4). Permission tests
>> have
>>>>> been flaky for quite some time. IIRC, there was some attempt to fix
>>> them
>>>> as
>>>>> well but still they appear with AccessDeniedException at times. Not
>>> much
>>>>> sure about AuditLoggingIT though.
>>>>> 
>>>>> Pherf test fails with HBase 2.1.
>>>>> 
>>>>> 
>>>>> On Fri, 28 May 2021 at 10:44 PM, larsh@apache.org <larsh@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> 5.1 CI builds seems to fail consistently - at least for the past
two
>>>>>> weeks.
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-mulitbranch/job/5.1/
>>>>>> 
>>>>>> Might be something to look into.
>>>>>> I have an hour today between meetings, I'll see what I can do. :)
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> On Wednesday, May 26, 2021, 10:54:13 PM PDT, Viraj Jasani <
>>>>>> vjasani@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks Lars. I had an offline chat with Istvan yesterday and I will
>>>>>> prepare
>>>>>> RCs over the weekend.
>>>>>> 
>>>>>> In the meantime, I will again take a look at Jira fixVersion and
git
>>>>>> commits compatibility and also try to help with any backport (if
>> still
>>>>>> pending).
>>>>>> 
>>>>>> 
>>>>>> On Wed, 26 May 2021 at 11:53 PM, larsh@apache.org <larsh@apache.org
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I cherry-picked three changes into branch 5.1.
>>>>>>> Assuming Jira is up-to-date, there are no 4.16.x issues that
are
>> not
>>>>>> also
>>>>>>> in 5.1.x.
>>>>>>> 
>>>>>>> There are 12 issues left that are in 4.17.0 and 5.2.0, but not
in
>>>> 5.1.x.
>>>>>>> But they are presumably not in 4.16.x because they violate
>> backward
>>>>>>> compatibility, and - if so - should not be in 5.1.2.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> On Wednesday, May 26, 2021, 10:15:31 AM PDT, larsh@apache.org
<
>>>>>>> larsh@apache.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Awesome. Thanks Istvan.
>>>>>>> 
>>>>>>> I'll do a pass through the issues too.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> On Tuesday, May 25, 2021, 8:45:00 PM PDT, Istvan Toth <
>>>> stoty@apache.org
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I am not aware of any blockers in 5.1.
>>>>>>> If there is no objection, and no other volunteers for the 5.1.2
RM
>>>>>> role, I
>>>>>>> intend to start the process on Monday.
>>>>>>> 
>>>>>>> I will also look for missing fixes before release, but if anyone
>> is
>>>>>> aware
>>>>>>> of any fix that is needed in 5.1.2 , but
>>>>>>> hasn't been backported, then please try to get it resolved by
>>> Monday.
>>>>>>> 
>>>>>>> regards
>>>>>>> Istvan
>>>>>>> 
>>>>>>> On Wed, May 26, 2021 at 4:19 AM larsh@apache.org <
>> larsh@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> There are 17 resolved issues against it.
>>>>>>>> PHOENIX-6436 is needed for further integration with Trino
>>> (formerly
>>>>>>>> Presto).
>>>>>>>> 
>>>>>>>> It is time for another point release?
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Mime
View raw message