kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From t...@apache.org
Subject [3/5] kudu git commit: KUDU-2242 Wait for NTP synchronization on startup
Date Tue, 05 Jun 2018 23:29:04 GMT
KUDU-2242 Wait for NTP synchronization on startup

Sometimes it takes a little while for the clock to be synchronized when
a machine first starts up. Occasionally, this causes Kudu to fail to
start when the host starts up. This patch allows Kudu to wait for
the system clock to become synchronized when using NTP by using
either the 'ntp-wait' or the 'chronyc waitsync' utilities. The amount
of time to wait is controlled by a new flag
--ntp_initial_sync_wait_secs, which defaults to 60. In the common case
when the clock is synchronized when Kudu starts, Kudu does not wait (and
does not incur the cost of waiting on a subprocess). Waiting can be
disabled by setting the new flag to 0.

This is hard to test with a unit test, so I manually tested as follows:

1. Restart Kudu on a machine with the clock sync'd. There was no waiting
   for synchronization.
2. Using both ntpd and chrony, stop the time daemon, set the clock back, and
   start Kudu, confirming Kudu exited immediately after the wait
   utilities were unable to contact the daemons.
3. Using both ntpd and chrony, stop the time daemon and Kudu, wait for
   the clock to be a little out of sync, then start the time daemon and
   Kudu, confirming that Kudu waited until the clock was in sync and
   started successfully.

Change-Id: I6082c5d35ed0d230d91c61734e7b5a351e50933b
Reviewed-on: http://gerrit.cloudera.org:8080/10548
Reviewed-by: Adar Dembo <adar@cloudera.com>
Tested-by: Will Berkeley <wdberkeley@gmail.com>
Reviewed-by: Mike Percy <mpercy@apache.org>
Reviewed-by: Hao Hao <hao.hao@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/1c3cbb1c
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/1c3cbb1c
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/1c3cbb1c

Branch: refs/heads/master
Commit: 1c3cbb1c7d27bd659719a45c5904f1c06df65e45
Parents: 46dd4cd
Author: Will Berkeley <wdberkeley@apache.org>
Authored: Sun May 27 02:41:13 2018 -0700
Committer: Will Berkeley <wdberkeley@gmail.com>
Committed: Tue Jun 5 17:32:33 2018 +0000

----------------------------------------------------------------------
 src/kudu/clock/system_ntp.cc | 60 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/1c3cbb1c/src/kudu/clock/system_ntp.cc
----------------------------------------------------------------------
diff --git a/src/kudu/clock/system_ntp.cc b/src/kudu/clock/system_ntp.cc
index a898b31..23e6cf6 100644
--- a/src/kudu/clock/system_ntp.cc
+++ b/src/kudu/clock/system_ntp.cc
@@ -25,6 +25,7 @@
 #include <string>
 #include <vector>
 
+#include <gflags/gflags.h>
 #include <gflags/gflags_declare.h>
 #include <glog/logging.h>
 
@@ -32,6 +33,7 @@
 #include "kudu/gutil/strings/join.h"
 #include "kudu/gutil/strings/substitute.h"
 #include "kudu/util/errno.h"
+#include "kudu/util/flag_tags.h"
 #include "kudu/util/logging.h"
 #include "kudu/util/path_util.h"
 #include "kudu/util/status.h"
@@ -39,6 +41,14 @@
 
 DECLARE_bool(inject_unsync_time_errors);
 
+DEFINE_int32(ntp_initial_sync_wait_secs, 60,
+             "Amount of time in seconds to wait for NTP to synchronize the "
+             "clock at startup. A value of zero means Kudu will fail to start "
+             "if the clock is unsynchronized. This flag can prevent Kudu from "
+             "crashing if it starts before NTP can synchronize the clock.");
+TAG_FLAG(ntp_initial_sync_wait_secs, evolving);
+TAG_FLAG(ntp_initial_sync_wait_secs, advanced);
+
 using std::string;
 using std::vector;
 using strings::Substitute;
@@ -63,8 +73,9 @@ Status CallAdjTime(timex* tx) {
     case TIME_OK:
       return Status::OK();
     case -1: // generic error
-      return Status::ServiceUnavailable("Error reading clock. ntp_adjtime() failed",
-                                        ErrnoToString(errno));
+      // From 'man 2 adjtimex', ntp_adjtime failure implies an improper 'tx'.
+      return Status::InvalidArgument("Error reading clock. ntp_adjtime() failed",
+                                     ErrnoToString(errno));
     case TIME_ERROR:
       return Status::ServiceUnavailable("Error reading clock. Clock considered unsynchronized");
     default:
@@ -100,6 +111,46 @@ void TryRun(vector<string> cmd, vector<string>* log) {
 
 }
 
+Status WaitForNtp() {
+  int32_t wait_secs = FLAGS_ntp_initial_sync_wait_secs;
+  if (wait_secs <= 0) {
+    LOG(INFO) << Substitute("Not waiting for clock synchronization: "
+                            "--ntp_initial_sync_wait_secs=$0 is nonpositive",
+                            wait_secs);
+    return Status::OK();
+  }
+  LOG(INFO) << Substitute("Waiting up to --ntp_initial_sync_wait_secs=$0 "
+                          "seconds for the clock to synchronize", wait_secs);
+  vector<string> cmd;
+  string exe;
+  Status s = FindExecutable("ntp-wait", {"/sbin", "/usr/sbin"}, &exe);
+  if (s.ok()) {
+    // -s is the number of seconds to sleep between retries.
+    // -n is the number of tries before giving up.
+    cmd = {exe, "-s", "1", "-n", std::to_string(wait_secs)};
+  } else {
+    LOG(WARNING) << "Could not find ntp-wait; trying chrony waitsync instead: "
+                 << s.ToString();
+    s = FindExecutable("chronyc", {"/sbin", "/usr/sbin"}, &exe);
+    if (!s.ok()) {
+      LOG(WARNING) << "Could not find chronyc: " << s.ToString();
+      return Status::NotFound("failed to find ntp-wait or chronyc");
+    }
+    // Usage: waitsync max-tries max-correction max-skew interval.
+    // max-correction and max-skew parameters as 0 means no checks.
+    // The interval is measured in seconds.
+    cmd = {exe, "waitsync", std::to_string(wait_secs), "0", "0", "1"};
+  }
+  // Unfortunately, neither ntp-wait nor chronyc waitsync print useful messages.
+  // Instead, rely on DumpDiagnostics.
+  s = Subprocess::Call(cmd);
+  if (!s.ok()) {
+    return s.CloneAndPrepend(
+        Substitute("failed to wait for clock sync using command '$0'",
+                   JoinStrings(cmd, " ")));
+  }
+  return Status::OK();
+}
 } // anonymous namespace
 
 void SystemNtp::DumpDiagnostics(vector<string>* log) const {
@@ -131,6 +182,11 @@ void SystemNtp::DumpDiagnostics(vector<string>* log) const {
 Status SystemNtp::Init() {
   timex timex;
   Status s = CallAdjTime(&timex);
+  if (s.IsServiceUnavailable()) {
+    s = WaitForNtp().AndThen([&timex]() {
+          return CallAdjTime(&timex);
+        });
+  }
   if (!s.ok()) {
     DumpDiagnostics(/* log= */nullptr);
     return s;


Mime
View raw message