hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-16284) KMS Cache Miss Storm
Date Wed, 01 May 2019 16:16:00 GMT
Wei-Chiu Chuang created HADOOP-16284:

             Summary: KMS Cache Miss Storm
                 Key: HADOOP-16284
                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
             Project: Hadoop Common
          Issue Type: Bug
          Components: kms
    Affects Versions: 2.6.0
         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
            Reporter: Wei-Chiu Chuang

We recently stumble upon a performance issue with KMS, where occasionally it exhibited "No
content to map" error (this cluster ran an old version that doesn't have HADOOP-14841) and
jobs crashed. *We bumped the number of KMSes from 2 to 4, and situation went even worse.*

Later, we realized this cluster had a few hundred encryption zones and a few hundred encryption
keys. This is pretty unusual because most of the deployments known to us has at most a dozen
keys. So in terms of number of keys, this cluster is 1-2 order of magnitude higher than any
one else.

The high number of encryption keys in creases the likelihood of key cache miss in KMS. In
Cloudera's setup, each cache miss forces KMS to sync with its backend, the Cloudera Keytrustee
Server. Plus the high number of KMSes amplifies the latency, effectively causing a [cache
miss storm|https://en.wikipedia.org/wiki/Cache_stampede].

We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will come up with a
better name later surely - and discovered a scalability bug in CKTS. The fix was verified
again with the tool.

Filing this bug so the community is aware of this issue. I don't have a solution for now in
KMS. But we want to address this scalability problem in the near future because we are seeing
use cases that requires thousands of encryption keys.
On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent fixes). A MapReduce
job acquires at most 3 KMS delegation tokens, and so for cases, such as distcp, it wouldn
fail to reach the 4th KMS on the remote cluster. I imagine similar issues exist for other
execution engines, but I didn't test.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message