ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [ambari] ramkrish86 opened a new pull request #3287: AMBARI-25620 Ambari Server STOP command on a node might fail because …
Date Mon, 22 Feb 2021 17:24:38 GMT

ramkrish86 opened a new pull request #3287:
URL: https://github.com/apache/ambari/pull/3287


   …of a decommissioned NM's behaviour
   
   ## What changes were proposed in this pull request?
   
   As part of our use case, before we STOP any set of components in a node, if we see a NodeManager
and Datanode in that node, we first Decommission them and then issue STOP components to all
the components in the node.
   
   DataNodes once decommissioned are not STOPPED but they are alive as a process. Unless we
stop them. Whereas RM stops the NM on decommission. (this is a known behaviour). But what
happens is that the RecoveryManager in ambari keeps restarting the service thinking its DESIRED
state (in the agent side) is STARTED. So the restart keeps happening. So the state changes
between STARTED <-> INSTALLED on the agent side and once this happens we communicate
the Component status to the server side.
   On receiving this update the server sets the STATE as STARTED/INSTALLED as the case may
be.
   Now coming back to the actual STOP command request that we gave, as per design in ambari
server once all the component updates are sent, it processes them in batch and tries to do
the in-memory transition of STATES on the server side (not the cache but the FSM (state machine
transition). Here the event is INSTALL/STOP event for NM that the server is expecting but
instead of getting an INSTALLED state it gets STARTED state. The reason as highlighted above.
So the entire STOP command gets aborted by the server thinking there is some problem in what
it sees.
   /Multimap is analog of Map<Object, List<Object>> but allows to avoid nested
loop
           ListMultimap<String, ServiceComponentHostEvent> eventMap = formEventMap(stage,
commandsToStart);
           Map<ExecutionCommand, String> commandsToAbort = new HashMap<>();
           if (!eventMap.isEmpty()) {
             LOG.debug("==> processing {} serviceComponentHostEvents...", eventMap.size());
             Cluster cluster = clusters.getCluster(stage.getClusterName());
             if (cluster != null) {
               Map<ServiceComponentHostEvent, String> failedEvents = cluster.processServiceComponentHostEvents(eventMap);
   
               if (failedEvents.size() > 0) {
                 LOG.error("==> {} events failed.", failedEvents.size());
               }
   
               for (Iterator<ExecutionCommand> iterator = commandsToUpdate.iterator();
iterator.hasNext(); ) {
                 ExecutionCommand cmd = iterator.next();
                 for (ServiceComponentHostEvent event : failedEvents.keySet()) {
                   if (StringUtils.equals(event.getHostName(), cmd.getHostname()) &&
                     StringUtils.equals(event.getServiceComponentName(), cmd.getRole())) {
                     iterator.remove();
                     commandsToAbort.put(cmd, failedEvents.get(event));
                     break;
                   }
                 }
               }
   Check the processServiceComponentHostEvents() for the way the transition happens and what
is the Invalid Transition that happens over there. The log msg would be like this
   org.apache.ambari.server.state.fsm.InvalidStateTransitionException: Invalid event: HOST_SVCCOMP_INSTALL
at STARTED
   Since this entire set of STOP component is considered as a FAILURe, we issue ABORT command
and hence all the STOP command issued to the agent are aborted.
   This makes the DN to stay in the STARTED state itself and hence the remaining DELETE HOST
command keeps failing.
   The idea is to ensure that for NM if decommissioned and the current state is STARTED for
a HOST_SVCCOMP_INSTALL then mark it as not a failure condition so that the commands are not
aborted.
   ## How was this patch tested?
   
   (Please explain how this patch was tested. Ex: unit tests, manual tests)
   The patch was tested with a scale down feature of more than 20 nodes at one shot. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



Mime
View raw message