spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingyu Kim <m...@palantir.com>
Subject Re: Utilizing YARN AM RPC port field
Date Wed, 15 Jun 2016 21:23:33 GMT
FYI, I just filed https://issues.apache.org/jira/browse/SPARK-15974.

 

Mingyu

 

From: Mingyu Kim <mkim@palantir.com>
Date: Tuesday, June 14, 2016 at 2:13 PM
To: Steve Loughran <stevel@hortonworks.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Matt Cheah <mcheah@palantir.com>
Subject: Re: Utilizing YARN AM RPC port field

 

Thanks for the pointers, Steve!

 

The first option sounds like a the most light-weight and non-disruptive option among them.
So, we can add a configuration that enables socket initialization, Spark AM will create a
ServerSocket if the socket init is enabled and set it on SparkContext

 

If there are no objections, I can file a bug and find time to tackle it myself. 

 

Mingyu

 

From: Steve Loughran <stevel@hortonworks.com>
Date: Tuesday, June 14, 2016 at 4:55 AM
To: Mingyu Kim <mkim@palantir.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Matt Cheah <mcheah@palantir.com>
Subject: Re: Utilizing YARN AM RPC port field

 

 

On 14 Jun 2016, at 01:30, Mingyu Kim <mkim@palantir.com> wrote:

 

Hi all,

 

YARN provides a way for AppilcationMaster to register a RPC port so that a client outside
the YARN cluster can reach the application for any RPCs, but Spark’s YARN AMs simply register
a dummy port number of 0. (See https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74)
This is useful for the long-running Spark application usecases where jobs are submitted via
a form of RPC to an already started Spark context running in YARN cluster mode. Spark job
server (https://github.com/spark-jobserver/spark-jobserver) and Livy (https://github.com/cloudera/hue/tree/master/apps/spark/java)
are good open-source examples of these usecases. The current work-around is to have the Spark
AM make a call back to a configured URL with the port number of the RPC server for the client
to communicate with the AM.

 

Utilizing YARN AM RPC port allows the port number reporting to be done in a secure way (i.e.
With AM RPC port field and Kerberized YARN cluster, you don’t need to re-invent a way to
verify the authenticity of the port number reporting.) and removes the callback from YARN
cluster back to a client, which means you can operate YARN in a low-trust environment and
run other client applications behind a firewall.

 

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you cannot simply
pre-configure the port number and pass it to Spark AM via configuration because of potential
port conflicts on the YARN node)

 

·         Start-up an empty Jetty server during Spark AM initialization, set the port number
when registering AM with RM, and pass a reference to the Jetty server into the Spark application
(e.g. through SparkContext) for the application to dynamically add servlet/resources to the
Jetty server.

·         Have an optional static method in the main class (e.g. initializeRpcPort()) which
optionally sets up a RPC server and returns the RPC port. Spark AM can call this method, register
the port number to RM and continue on with invoking the main method. I don’t see this making
a good API, though.

 

I’m curious to hear what other people think. Would this be useful for anyone? What do you
think about the proposals? Please feel free to suggest other ideas. Thanks!

 

 

It's a recurrent irritation of mine that you can't ever change the HTTP/RPC ports of a YARN
AM after launch; it creates a complex startup state where you can't register until your IPC
endpoints are up.

 

Tactics

 

-Create a socket on an empty port, register it, hand off the port to the RPC setup code as
the chosen port. Ideally, support a range to scan, so that systems which only open a specific
range of ports, e.g. 6500-6800 can have those ports only scanned. We've done this in other
projects.

 

-serve up the port binding info via a REST API off the AM web; clients hit the (HEAD/GET only
RM Proxy), ask for the port, work on it. Nonstandard; could be extensible with other binding
information. (TTL of port caching, ....)

 

-Use the YARN-913 ZK based registry to register/lookup bindings. This is used in various YARN
apps to register service endpoints (RPC, Rest); there's work ongoing for DNS support. this
would allow you to use DNS against a specific DNS server to get the endpoints. Works really
well with containerized deployments where the apps come up with per-container IPaddresses
and fixed ports.

Although you couldn't get the latter into the spark-yarn codeitself (needs Hadoop 2.6+), you
can plug in support via the extension point implemented in SPARK-11314., I've actually thought
of doing that for a while...just been too busy.

 

-Just fix the bit of the YARN api that forces you to know your endpoints in advance. People
will appreciate it, though it will take a while to trickle downstream.

 

 

 

 


Mime
View raw message