kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ale...@apache.org
Subject [4/5] kudu git commit: KUDU-2044 Tombstoned tablets show up in /metrics
Date Tue, 17 Oct 2017 15:23:21 GMT
KUDU-2044 Tombstoned tablets show up in /metrics

This patch stops tombstoned tablets from showing up in /metrics. When a
tablet replica is shut down, the tablet's metric entity is marked as
unpublished. All its child metrics will be marked for retirement, then
retired after the retirement interval has passed. Then the entity
itself is unregistered, i.e. removed from the metric registry's map of
entities. If a new replica of the same tablet is added to the server,
it will create a new entity that will be registered with the
metric_registry, either as a new insertion or overwriting the old
replica's entity if the entity had been unpublished but not yet
unregistered.

The tombstoned tablet's metric entity is not destroyed when it's
deregistered, since there are still refs to it and its metrics in the
TabletReplica class, Tablet class, and many other classes associated
with a TabletReplica. The entity will be destroyed when the
TabletReplica is deleted (since it either contains or holds final
references to all the other classes), which happens if the tablet is
replaced or deleted. While it's not ideal to hold the memory for the
entity when it's no longer used, the reason this occurs is because the
TabletReplica instance mostly stays alive. Releasing all the metric
references would require specifically dropping refs to those metrics in
all the surviving subcomponents of a TabletReplica instance that has
been shut down; I think this problem would be better solved by more
completely cleaning up a shut down TabletReplica instance, but that's a
much larger scope than suppressing the metrics.

Change-Id: I21b52262203410ded1e514502b59a79be12f8fb3
Reviewed-on: http://gerrit.cloudera.org:8080/7981
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <davidralves@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/8c23c97b
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/8c23c97b
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/8c23c97b

Branch: refs/heads/master
Commit: 8c23c97b82708b96e163e75e0002ce96adf8d8fc
Parents: 60ea4a6
Author: Will Berkeley <wdberkeley@apache.org>
Authored: Wed Sep 6 10:57:39 2017 -0700
Committer: Will Berkeley <wdberkeley@gmail.com>
Committed: Tue Oct 17 14:54:51 2017 +0000

----------------------------------------------------------------------
 src/kudu/integration-tests/tablet_copy-itest.cc | 72 ++++++++++++++++++--
 src/kudu/tablet/tablet.cc                       |  3 +
 src/kudu/tserver/tablet_server-test.cc          | 47 +++++++++++++
 src/kudu/util/metrics.cc                        | 18 +++--
 src/kudu/util/metrics.h                         | 17 ++++-
 5 files changed, 146 insertions(+), 11 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/8c23c97b/src/kudu/integration-tests/tablet_copy-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/tablet_copy-itest.cc b/src/kudu/integration-tests/tablet_copy-itest.cc
index 88a9350..a36e65e 100644
--- a/src/kudu/integration-tests/tablet_copy-itest.cc
+++ b/src/kudu/integration-tests/tablet_copy-itest.cc
@@ -124,10 +124,12 @@ using std::unordered_map;
 using std::vector;
 using strings::Substitute;
 
+METRIC_DECLARE_entity(tablet);
 METRIC_DECLARE_histogram(handler_latency_kudu_consensus_ConsensusService_UpdateConsensus);
 METRIC_DECLARE_counter(glog_info_messages);
 METRIC_DECLARE_counter(glog_warning_messages);
 METRIC_DECLARE_counter(glog_error_messages);
+METRIC_DECLARE_counter(rows_inserted);
 METRIC_DECLARE_counter(tablet_copy_bytes_fetched);
 METRIC_DECLARE_counter(tablet_copy_bytes_sent);
 METRIC_DECLARE_gauge_int32(tablet_copy_open_client_sessions);
@@ -1624,6 +1626,17 @@ int64_t CountBootstrappingTablets(ExternalTabletServer *ets) {
       &ret));
   return ret;
 }
+
+int64_t CountRowsInserted(ExternalTabletServer *ets) {
+  int64_t ret;
+  CHECK_OK(ets->GetInt64Metric(
+      &METRIC_ENTITY_tablet,
+      nullptr,
+      &METRIC_rows_inserted,
+      "value",
+      &ret));
+  return ret;
+}
 } // anonymous namespace
 
 // Check that state metrics work during tablet copy.
@@ -1676,10 +1689,9 @@ TEST_F(TabletCopyITest, TestTabletStateMetricsDuringTabletCopy) {
   });
 
   // Wait for the tablet to be revived via tablet copy.
-  ASSERT_OK(inspect_->WaitForTabletDataStateOnTS(
-      follower_index, tablet_id,
-      { tablet::TABLET_DATA_READY },
-      kTimeout));
+  ASSERT_OK(inspect_->WaitForTabletDataStateOnTS(follower_index, tablet_id,
+                                                 { tablet::TABLET_DATA_READY },
+                                                 kTimeout));
 
   // State: 1 running or bootstrapping tablet.
   ASSERT_EVENTUALLY([&] {
@@ -1688,6 +1700,58 @@ TEST_F(TabletCopyITest, TestTabletStateMetricsDuringTabletCopy) {
   });
 }
 
+TEST_F(TabletCopyITest, TestMetricsResetAfterRevival) {
+  MonoDelta kTimeout = MonoDelta::FromSeconds(30);
+  const int kTsIndex = 1; // Pick a random TS.
+  NO_FATALS(StartCluster());
+
+  // Populate the tablet with some data.
+  TestWorkload workload(cluster_.get());
+  workload.Setup();
+  workload.Start();
+  while (workload.rows_inserted() < 1000) {
+    SleepFor(MonoDelta::FromMilliseconds(10));
+  }
+  workload.StopAndJoin();
+
+  TServerDetails* ts = ts_map_[cluster_->tablet_server(kTsIndex)->uuid()];
+  vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets;
+  ASSERT_OK(WaitForNumTabletsOnTS(ts, 1, kTimeout, &tablets));
+  string tablet_id = tablets[0].tablet_status().tablet_id();
+
+  // Ensure all the servers agree before we proceed.
+  ASSERT_OK(WaitForServersToAgree(kTimeout, ts_map_, tablet_id,
+                                  workload.batches_completed()));
+
+  // Find the leader so we can tombstone a follower, which makes the test faster and more
focused.
+  int leader_index;
+  int follower_index;
+  ASSERT_EVENTUALLY([&]() {
+    TServerDetails* leader_ts;
+    TServerDetails* follower_ts;
+    ASSERT_OK(FindTabletLeader(ts_map_, tablet_id, kTimeout, &leader_ts));
+    leader_index = cluster_->tablet_server_index_by_uuid(leader_ts->uuid());
+    follower_index = (leader_index + 1) % cluster_->num_tablet_servers();
+    follower_ts = ts_map_[cluster_->tablet_server(follower_index)->uuid()];
+
+    // Make sure the metrics reflect that some rows have been inserted.
+    ASSERT_GT(CountRowsInserted(cluster_->tablet_server(follower_index)), 0);
+
+    // Tombstone the tablet on the follower.
+    ASSERT_OK(itest::DeleteTablet(follower_ts, tablet_id,
+                                  TABLET_DATA_TOMBSTONED,
+                                  boost::none, kTimeout));
+  });
+
+  // Wait for the tablet to be revived via tablet copy.
+  ASSERT_OK(inspect_->WaitForTabletDataStateOnTS(follower_index, tablet_id,
+                                                 { tablet::TABLET_DATA_READY },
+                                                 kTimeout));
+
+  // The rows inserted should have reset back to 0.
+  ASSERT_EQ(0, CountRowsInserted(cluster_->tablet_server(follower_index)));
+}
+
 // Test that tablet copy session initialization can handle concurrency when
 // there is latency during session initialization. This is a regression test
 // for KUDU-2124.

http://git-wip-us.apache.org/repos/asf/kudu/blob/8c23c97b/src/kudu/tablet/tablet.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/tablet.cc b/src/kudu/tablet/tablet.cc
index 0b16f07..e7fa042 100644
--- a/src/kudu/tablet/tablet.cc
+++ b/src/kudu/tablet/tablet.cc
@@ -299,6 +299,9 @@ void Tablet::Shutdown() {
   std::lock_guard<rw_spinlock> lock(component_lock_);
   components_ = nullptr;
   state_ = kShutdown;
+  if (metric_entity_) {
+    metric_entity_->Unpublish();
+  }
 
   // In the case of deleting a tablet, we still keep the metadata around after
   // ShutDown(), and need to flush the metadata to indicate that the tablet is deleted.

http://git-wip-us.apache.org/repos/asf/kudu/blob/8c23c97b/src/kudu/tserver/tablet_server-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tserver/tablet_server-test.cc b/src/kudu/tserver/tablet_server-test.cc
index 3406d7a..8016d47 100644
--- a/src/kudu/tserver/tablet_server-test.cc
+++ b/src/kudu/tserver/tablet_server-test.cc
@@ -2609,5 +2609,52 @@ TEST_F(TabletServerTest, TestDataDirGroupsCreated) {
   ASSERT_TRUE(md.Compare(orig_group, new_group));
 }
 
+TEST_F(TabletServerTest, TestNoMetricsForTombstonedTablet) {
+  // Force the metrics to be retired immediately.
+  FLAGS_metrics_retirement_age_ms = 0;
+
+  scoped_refptr<TabletReplica> tablet;
+  ASSERT_TRUE(mini_server_->server()->tablet_manager()->LookupTablet(kTabletId,
&tablet));
+
+  // Insert one row and check the insertion is recorded in the metrics.
+  ASSERT_NO_FATAL_FAILURE(InsertTestRowsRemote(0, 1, 1));
+  scoped_refptr<Counter> rows_inserted =
+      METRIC_rows_inserted.Instantiate(tablet->tablet()->GetMetricEntity());
+  int64_t num_rows_running = rows_inserted->value();
+  ASSERT_EQ(1, num_rows_running);
+
+  // Tombstone the tablet.
+  DeleteTabletRequestPB req;
+  DeleteTabletResponsePB resp;
+  RpcController rpc;
+  req.set_dest_uuid(mini_server_->server()->fs_manager()->uuid());
+  req.set_tablet_id(kTabletId);
+  req.set_delete_type(tablet::TABLET_DATA_TOMBSTONED);
+  {
+    SCOPED_TRACE(SecureDebugString(req));
+    ASSERT_OK(admin_proxy_->DeleteTablet(req, &resp, &rpc));
+    SCOPED_TRACE(SecureDebugString(resp));
+    ASSERT_FALSE(resp.has_error());
+  }
+
+  // It takes three calls to /jsonmetricz for the tablet metrics to go away, based on the
+  // policy in MetricRegistry::RetireOldMetrics:
+  // 1. The entity's metrics are returned, but also marked for retirement.
+  // 2. The entity's metrics are returned, but also retired (causing the entity to be retired).
+  // 3. The metrics aren't returned-- the entity has been removed from the metrics registry.
+  EasyCurl c;
+  faststring buf;
+  for (int i = 0; i < 3; i++) {
+    ASSERT_OK(c.FetchURL(strings::Substitute("http://$0/jsonmetricz",
+                                             mini_server_->bound_http_addr().ToString()),
+                         &buf));
+    if (i < 2) {
+      ASSERT_STR_CONTAINS(buf.ToString(), "\"type\": \"tablet\"");
+    } else {
+      ASSERT_STR_NOT_CONTAINS(buf.ToString(), "\"type\": \"tablet\"");
+    }
+  }
+}
+
 } // namespace tserver
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/8c23c97b/src/kudu/util/metrics.cc
----------------------------------------------------------------------
diff --git a/src/kudu/util/metrics.cc b/src/kudu/util/metrics.cc
index a053d55..323ea80 100644
--- a/src/kudu/util/metrics.cc
+++ b/src/kudu/util/metrics.cc
@@ -164,7 +164,8 @@ MetricEntity::MetricEntity(const MetricEntityPrototype* prototype,
                            std::string id, AttributeMap attributes)
     : prototype_(prototype),
       id_(std::move(id)),
-      attributes_(std::move(attributes)) {}
+      attributes_(std::move(attributes)),
+      published_(true) {}
 
 MetricEntity::~MetricEntity() {
 }
@@ -264,7 +265,7 @@ void MetricEntity::RetireOldMetrics() {
   for (auto it = metric_map_.begin(); it != metric_map_.end();) {
     const scoped_refptr<Metric>& metric = it->second;
 
-    if (PREDICT_TRUE(!metric->HasOneRef())) {
+    if (PREDICT_TRUE(!metric->HasOneRef() && published_)) {
       // The metric is still in use. Note that, in the case of "NeverRetire()", the metric
       // will have a ref-count of 2 because it is reffed by the 'never_retire_metrics_'
       // collection.
@@ -278,8 +279,8 @@ void MetricEntity::RetireOldMetrics() {
     }
 
     if (!metric->retire_time_.Initialized()) {
-      VLOG(3) << "Metric " << it->first << " has become un-referenced.
Will retire after "
-              << "the retention interval";
+      VLOG(3) << "Metric " << it->first << " has become un-referenced
or unpublished. "
+              << "Will retire after the retention interval";
       // This is the first time we've seen this metric as retirable.
       metric->retire_time_ =
           now + MonoDelta::FromMilliseconds(FLAGS_metrics_retirement_age_ms);
@@ -358,8 +359,10 @@ void MetricRegistry::RetireOldMetrics() {
   for (auto it = entities_.begin(); it != entities_.end();) {
     it->second->RetireOldMetrics();
 
-    if (it->second->num_metrics() == 0 && it->second->HasOneRef()) {
-      // No metrics and no external references to this entity, so we can retire it.
+    if (it->second->num_metrics() == 0 &&
+        (it->second->HasOneRef() || !it->second->published())) {
+      // This entity has no metrics and either has no more external references or has
+      // been marked as unpublished, so we can remove it.
       // Unlike retiring the metrics themselves, we don't wait for any timeout
       // to retire them -- we assume that that timed retention has been satisfied
       // by holding onto the metrics inside the entity.
@@ -476,6 +479,9 @@ scoped_refptr<MetricEntity> MetricRegistry::FindOrCreateEntity(
   if (!e) {
     e = new MetricEntity(prototype, id, initial_attributes);
     InsertOrDie(&entities_, id, e);
+  } else if (!e->published()) {
+    e = new MetricEntity(prototype, id, initial_attributes);
+    entities_[id] = e;
   } else {
     e->SetAttributes(initial_attributes);
   }

http://git-wip-us.apache.org/repos/asf/kudu/blob/8c23c97b/src/kudu/util/metrics.h
----------------------------------------------------------------------
diff --git a/src/kudu/util/metrics.h b/src/kudu/util/metrics.h
index 7a11d43..fe0c635 100644
--- a/src/kudu/util/metrics.h
+++ b/src/kudu/util/metrics.h
@@ -501,6 +501,18 @@ class MetricEntity : public RefCountedThreadSafe<MetricEntity>
{
     return metric_map_.size();
   }
 
+  // Mark this entity as unpublished. This will cause the registry to retire its metrics
+  // and unregister it.
+  void Unpublish() {
+    std::lock_guard<simple_spinlock> l(lock_);
+    published_ = false;
+  }
+
+  bool published() {
+    std::lock_guard<simple_spinlock> l(lock_);
+    return published_;
+  }
+
  private:
   friend class MetricRegistry;
   friend class RefCountedThreadSafe<MetricEntity>;
@@ -522,11 +534,14 @@ class MetricEntity : public RefCountedThreadSafe<MetricEntity>
{
   // Map from metric name to Metric object. Protected by lock_.
   MetricMap metric_map_;
 
-  // The key/value attributes. Protected by lock_
+  // The key/value attributes. Protected by lock_.
   AttributeMap attributes_;
 
   // The set of metrics which should never be retired. Protected by lock_.
   std::vector<scoped_refptr<Metric> > never_retire_metrics_;
+
+  // Whether this entity is published. Protected by lock_.
+  bool published_;
 };
 
 // Base class to allow for putting all metrics into a single container.


Mime
View raw message