cassandra-pr mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [cassandra] dcapwell commented on a change in pull request #446: CASSANDRA-15564 Refactor repair coordinator to centralize stage change logic and improved the public facing errors
Date Tue, 10 Mar 2020 17:01:20 GMT
dcapwell commented on a change in pull request #446: CASSANDRA-15564 Refactor repair coordinator
to centralize stage change logic and improved the public facing errors
URL: https://github.com/apache/cassandra/pull/446#discussion_r390468434
 
 

 ##########
 File path: src/java/org/apache/cassandra/repair/RepairRunnable.java
 ##########
 @@ -578,86 +648,26 @@ public RepairCompleteCallback(UUID parentSession,
 
         public void onSuccess(Object result)
         {
-            if (!options.isPreview())
-            {
-                SystemDistributedKeyspace.successfulParentRepair(parentSession, successfulRanges);
-            }
-            final String message;
+            maybeStoreParentRepairSuccess(successfulRanges);
             if (hasFailure.get())
             {
-                StorageMetrics.repairExceptions.inc();
-                message = "Some repair failed";
-                fireProgressEvent(new ProgressEvent(ProgressEventType.ERROR, progress.get(),
totalProgress,
-                                                    message));
+                fail(null);
 
 Review comment:
   Nope!  
   
   The issue is that errors don't get aggregated so the call to fail the job may not know
the root cause(s).  If we add an error when `fail` is called then the client will get this
last; producing a somewhat meaningless error message.
   
   Right now error messages are sent to client at the source of the error (lets say 100 RepairJobs
submitted and 2 fail; client gets 2 error events).  The current nodetool logic doesn't log
the errors but saves the last seen error as the "root cause" and prints that on complete.
 With the current example one of the repair jobs will randomly be chosen as the root cause
(could be two distinct issues...) and reported to the operator.  If we switch to fail producing
an error (which happens on trunk in some cases) the operator gets a meaningless error message
saying "something broke".
   
   This code assumes the root cause triggered an error event; it was an even larger refactor
to propagate errors up, so chose not to do that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org


Mime
View raw message