hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-16284) KMS Cache Miss Storm
Date Wed, 08 May 2019 01:59:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835245#comment-16835245
] 

Wei-Chiu Chuang commented on HADOOP-16284:
------------------------------------------

{quote}Do you know why the number of keys is relevant? Is the key cache evicting them due
to size or the accesses for a particular key are more distributed over time vs a few highly
contended keys?
{quote}
I don't manage the KMS key provider backend (CKTS) so I am afraid I can't offer the implementation
details. IIRC, the minimum latency we observed was around 100 ms (each KMS to CKTS connection
involves PGP computation and other stuff so tend to be slow). I am not very sure if the latency
is proportional to the number of encryption keys we have, but it's proportional to the number
of KMS, because the backend has a global write lock design, and only one request is allowed
at a time.

Se saw key provider latency going as high as 20 seconds each during test when there are 4
KMSes. Consider an extreme case when you start KMS cold and that you have many encryption
zone/keys, it is likely to trigger multiple cache misses consecutively immediately after restart.
In this case, we observed KMS outage for several minutes after a KMS restart. After the KMS
stabilizes, some encryption keys are rarely used and when they are used, they trigger cache
miss from time to time.

!4 kms, no KTS patch.png!

Additionally, there's already a production workload for KMS, and KMS runs out of threads
easily. We actually saw "No content to map" exception despite very low CPU utilization, and
we were puzzled at first.

> KMS Cache Miss Storm
> --------------------
>
>                 Key: HADOOP-16284
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>    Affects Versions: 2.6.0
>         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: 4 kms, no KTS patch.png
>
>
> We recently stumble upon a performance issue with KMS, where occasionally it exhibited
"No content to map" error (this cluster ran an old version that doesn't have HADOOP-14841)
and jobs crashed. *We bumped the number of KMSes from 2 to 4, and situation went even worse.*
> Later, we realized this cluster had a few hundred encryption zones and a few hundred
encryption keys. This is pretty unusual because most of the deployments known to us has at
most a dozen keys. So in terms of number of keys, this cluster is 1-2 order of magnitude
higher than any one else.
> The high number of encryption keys in creases the likelihood of key cache miss in KMS.
In Cloudera's setup, each cache miss forces KMS to sync with its backend, the Cloudera Keytrustee
Server. Plus the high number of KMSes amplifies the latency, effectively causing a [cache
miss storm|https://en.wikipedia.org/wiki/Cache_stampede].
> We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will come up with
a better name later surely - and discovered a scalability bug in CKTS. The fix was verified
again with the tool.
> Filing this bug so the community is aware of this issue. I don't have a solution for
now in KMS. But we want to address this scalability problem in the near future because we
are seeing use cases that requires thousands of encryption keys.
> ----
> On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent fixes).
A MapReduce job acquires at most 3 KMS delegation tokens, and so for cases, such as distcp,
it wouldn fail to reach the 4th KMS on the remote cluster. I imagine similar issues exist
for other execution engines, but I didn't test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message