hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qihong Wu (Jira)" <j...@apache.org>
Subject [jira] [Created] (YARN-10851) Tez session close does not interrupt yarn's async thread
Date Wed, 07 Jul 2021 22:25:00 GMT
Qihong Wu created YARN-10851:
--------------------------------

             Summary: Tez session close does not interrupt yarn's async thread
                 Key: YARN-10851
                 URL: https://issues.apache.org/jira/browse/YARN-10851
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.10.1, 2.8.5
         Environment: On an HA cluster, where RM1 is not the active RM
Yarn of version 2.8.5 and is configured with Tez
            Reporter: Qihong Wu
         Attachments: hive.log

Hi, I want to ask for the expertise knowledge on the yarn behavior when handling `InterruptedIOException`. 

The issue occurs on a HA cluster, where RM1 is NOT the active RM. Therefore, if the yarn request
made to RM1 failed, the RM failover should happen. However, if an interrupted exception is
thrown when connecting to RM1, the thread should try to [bail out|https://dzone.com/articles/how-to-handle-the-interruptedexception]
as soon as possible to [respect interrupt request|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdownNow--],
rather than moving on to another RM.

But I found my application (hive) after throwing `InterruptedIOException` when trying to connect
with RM1 failed, continuing to RM2. I want to know how does yarn handle InterruptedIOException,
shouldn't the async thread gets interrupted and shutdown when tez close() triggered interrupt
request?



*The reproduction step is:*
 1. In an HA cluster which uses yarn of version 2.8.5 and is configured with Tez
 2. Make sure RM1 is not the active RM by checking `yarn rmadmin -getAllServiceState`. It
it is, manually [transition RM2 as active RM|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html#Admin_commands].
 3. Apply failover-retry properties to yarn-site.xml 
{quote}<property>
 <name>yarn.client.failover-retries</name>
 <value>4</value>
 </property>
 <property>
 <name>yarn.client.failover-retries-on-socket-timeouts</name>
 <value>4</value>
 </property>
 <property>
 <name>yarn.client.failover-max-attempts</name>
 <value>4</value>
 </property>
{quote}
4. Run a simple application to yarn-client (for example, a simple hive DDL command)
{quote}hive --hiveconf hive.root.logger=TRACE,console -e "create table tez_test (id int, name
string);"
{quote}
5. Find from application's log (for example, hive.log), you can find `RetryInvocationHandler`
has captured the `InterruptedIOException` when request was talking over rm1, but the thread
didn't bail out immediately, but continue moving to rm2.



*More information:*
The interrupted exception is triggered via via [TezSessionState#close|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L689]
and [Future#cancel|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message