hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7472) RPC client should deal with the IP address changes
Date Wed, 03 Aug 2011 22:05:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079066#comment-13079066
] 

Kihwal Lee commented on HADOOP-7472:
------------------------------------

h4. How this Jira depends on MAPREDUCE-2764


For the patch in this Jira to work, the upper layer (e.g. DFSClient) must use the host name
or CNAME(alias) that is intended to locate the service when instantiating InetSocketAddress
of the RPC server (e.g. namenode). Otherwise, detecting an IP address change may not work.
Unfortunately this is not always possible because certain components force only the IP address
string to be used for creating InetSocketAddress. A reverse lookup after this may not return
the original host name or alias.

An early attempt was made to ensure the correct behavior for DFSClient, but it was quickly
realized that the tokens won't work due to the use of cached IP address in token renewals.
 For this reason, MAPREDUCE-2764 must deal with the same problem (caching and/or exclusive
use of IP address) in more broader scope. 

In more detail, MAPREDUCE-2764 affects this Jira in following ways:
* Provide a way to keep the original host name string the user used and let the RPC Client
access it. This is critical for the reliable detection of IP address changes.
* Let the token cache and token renewal work regardless of IP address changes. This enables
this Jira to support broader range of clients/apps, especially long running ones.

This jira (IP address change detection) tries to achieve a somewhat better error-recovery
by avoiding restart of some components. The initial target is very narrow, but future work
may broaden the scope. 

> RPC client should deal with the IP address changes
> --------------------------------------------------
>
>                 Key: HADOOP-7472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7472
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.205.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Minor
>             Fix For: 0.20.205.0
>
>         Attachments: addr_change_dfs-1.patch.txt, addr_change_dfs-2.patch.txt, addr_change_dfs-3.patch.txt,
addr_change_dfs.patch.txt, addr_change_dfs_0_20s-1.patch.txt, addr_change_dfs_0_20s.patch.txt,
addr_change_dfs_trunk-1.patch.txt, addr_change_dfs_trunk-2.patch.txt, addr_change_dfs_trunk.patch.txt
>
>
> The current RPC client implementation and the client-side callers assume that the hostname-address
mappings of servers never change. The resolved address is stored in an immutable InetSocketAddress
object above/outside RPC, and the reconnect logic in the RPC Connection implementation also
trusts the resolved address that was passed down.
> If the NN suffers a failure that requires migration, it may be started on a different
node with a different IP address. In this case, even if the name-address mapping is updated
in DNS, the cluster is stuck trying old address until the whole cluster is restarted.
> The RPC client-side should detect this situation and exit or try to recover.
> Updating ConnectionId within the Client implementation may get the system work for the
moment, there always is a risk of the cached address:port become connectable again unintentionally.
The real solution will be notifying upper layer of the address change so that they can re-resolve
and retry or re-architecture the system as discussed in HDFS-34. 
> For 0.20 lines, some type of compromise may be acceptable. For example, raise a custom
exception for some well-defined high-impact upper layer to do re-resolve/retry, while other
will have to restart.  For TRUNK, the HA work will most likely determine what needs to be
done.  So this Jira won't cover the solutions for TRUNK.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message