hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim yu (Jira)" <j...@apache.org>
Subject [jira] [Created] (YARN-10464) Flink job on YARN with HA enabled crashes all RMs on attempt recovery
Date Sun, 18 Oct 2020 16:00:00 GMT
tim yu created YARN-10464:
-----------------------------

             Summary: Flink job on YARN with HA enabled crashes all RMs on attempt recovery
                 Key: YARN-10464
                 URL: https://issues.apache.org/jira/browse/YARN-10464
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
         Environment: some properties in yarn-site.xml: 

<property>
  <name>yarn.resourcemanager.recovery.enabled</name>
  <value>true</value>
 </property> 
 
 <property>
  <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
  <value>false</value>
 </property>
            Reporter: tim yu


I am trying to make Flink (1.11.1) job on our Hadoop cluster (2.6.0) with HA enabled but when
I test it out by killing the active RM it brings down the entire cluster.
I have configured Flink's HA in flink-conf.yml.
When I try to kill the active RM using kill -9, YARN correctly switches to the standby RM
and I can see applications as ACCEPTED for a minute but soon the standby RM crashes throwing
the following exception:
2020-10-18 15:39:36.112 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
 at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
 at java.lang.Thread.run(Thread.java:745)

I found some code about submitting high-availability jobs in flink project:

  private void activateHighAvailabilitySupport(ApplicationSubmissionContext appContext) throws
			InvocationTargetException, IllegalAccessException {

		ApplicationSubmissionContextReflector reflector = ApplicationSubmissionContextReflector.getInstance();
		reflector.setKeepContainersAcrossApplicationAttempts(appContext, true);
		reflector.setAttemptFailuresValidityInterval(
				appContext,
				flinkConfiguration.getLong(YarnConfigOptions.APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL));
	}
	
Flink HA jobs set KeepContainersAcrossApplicationAttempts to true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message