hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11671) Asynchronous native RPC v9 client
Date Mon, 09 Mar 2015 21:52:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353699#comment-14353699
] 

Colin Patrick McCabe commented on HADOOP-11671:
-----------------------------------------------

Hi [~wheat9],

I agree that this is an interesting project.  However, I feel like the current description
of this JIRA is misleading because it implies that the alternative to an asynchronous native
RPC client is JNI.  In fact, the alternative is something like the synchronous native RPC
client which now exists in the HDFS-6994 branch.

I'd also like to repeat the comment I made here: https://issues.apache.org/jira/browse/HDFS-6994?focusedCommentId=14329333&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14329333
:
bq. If you look at a high-performance HDFS client like Impala or HAWQ, they are fine with
synchronous APIs. Why? Well, most of the time your read performance is limited by the bandwidth
of the local disks (high performance clients always try to do local reads, and use short-circuit
and mmap if possible). A local hard disk can't handle more than maybe 100 seeks a second,
and the more seeks you do, the lower your bandwidth will be.  There is also the CPU aspect:
what are you doing with the data? Sure you can have 10,000 async requests going with 1 thread,
but if that thread is actually doing anything with the data, you can cut a few zeroes off
of that. And then you're back to an amount of concurrent reads that can be comfortably done
synchronously. Async APIs work best for cases where you are doing very, very little processing
on each request. So an async web server like ngnix, which is written in highly optimized straight
C (no ++) can squeeze a few more pages per second out of reducing its thread count. But in
a DB it's tougher (and as you mentioned, it also makes the code much more complex). So while
we should probably consider an async client at some point, I think it is much lower priority
than other things (like finishing the existing native client and merging it)

I guess you are probably already aware of this (since you linked the JIRAs), but it's worth
mentioning that the HADOOP-10388 branch already implemented a native asynchronous RPC client.
 The only dependency was libuv.  You can check it out at http://svn.apache.org/repos/asf/hadoop/common/branches/HADOOP-10388/hadoop-native-core/src/main/native/rpc/

I don't want to be a kill-joy, but we now have 3 branches with native client RPC implementations,
and 0 usable native clients.  I hope that we can integrate this efforts with the HDFS-6994
effort and deliver something that works.

> Asynchronous native RPC v9 client
> ---------------------------------
>
>                 Key: HADOOP-11671
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11671
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Haohui Mai
>            Assignee: Haohui Mai
>              Labels: native, rpc
>
> There are more and more integration happening between Hadoop and applications that are
implemented using languages other than Java.
> To access Hadoop, applications either have to go through JNI (e.g. libhdfs), or to reverse
engineer the Hadoop RPC protocol. (e.g. snakebite). Unfortunately, neither of them are satisfactory:
> * Integrating with JNI requires running a JVM inside the application. Some applications
(e.g., real-time processing, MPP database) does not want the footprints and GC behavior of
the JVM.
> * The Hadoop RPC protocol has a rich feature set including wire encryption, SASL, Kerberos
authentication. Many 3rd-party implementations can fully cover the feature sets thus they
might work in limited environment.
> This jira is to propose implementing an Hadoop RPC library in C++ that provides a common
ground to implement higher-level native client for HDFS, YARN, and MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message