spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lee <alee...@hotmail.com>
Subject RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication
Date Tue, 06 May 2014 07:26:00 GMT
Hi Jacob,
I agree, we need to address both driver and workers bidirectionally.
If the subnet is isolated and self-contained, only limited ports are configured to access
the driver via a dedicated gateway for the user, could you explain your concern? or what doesn't
satisfy the security criteria?
Are you referring to any security certificate or regulation requirement that separate subnet
with a configurable policy couldn't satisfy?
What I meant a subnet basically includes both driver and Workers running in this subnet. See
following example setup.
e.g. (254 max nodes for example)Hadoop / HDFS => 10.5.5.0/24 (GW 10.5.5.1) eth0Spark Driver
and Worker bind to => 10.10.10.0/24 eth1 with routing to 10.5.5.0/24 on specific ports
for NameNode and DataNode.So basically driver and Worker are bound to the same subnet that
is separated from others.iptables for 10.10.10.0/24 can allow SSH 22 login (or port forwarding)
onto the Spark Driver machine to launch shell or submit spark jobs.


Subject: RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking
communication
To: user@spark.apache.org
From: jeising@us.ibm.com
Date: Mon, 5 May 2014 12:40:53 -0500


Howdy Andrew,



I agree; the subnet idea is a good one...  unfortunately, it doesn't really help to secure
the network.



You mentioned that the drivers need to talk to the workers.  I think it is slightly broader
- all of the workers and the driver/shell need to be addressable from/to each other on any
dynamic port.



I would check out setting the environment variable SPARK_LOCAL_IP [1].  This seems to enable
Spark to bind correctly to a private subnet.



Jacob



[1]  http://spark.apache.org/docs/latest/configuration.html 



Jacob D. Eisinger

IBM Emerging Technologies

jeising@us.ibm.com - (512) 286-6075



Andrew Lee ---05/04/2014 09:57:08 PM---Hi Jacob, Taking both concerns into account, I'm actually
thinking about using a separate subnet to



From:	Andrew Lee <alee526@hotmail.com>

To:	"user@spark.apache.org" <user@spark.apache.org>

Date:	05/04/2014 09:57 PM

Subject:	RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking
communication








Hi Jacob,



Taking both concerns into account, I'm actually thinking about using a separate subnet to
isolate the Spark Workers, but need to look into how to bind the process onto the correct
interface first. This may require some code change.

Separate subnet doesn't limit itself with port range so port exhaustion should rarely happen,
and won't impact performance.



By opening up all port between 32768-61000 is actually the same as no firewall, this expose
some security concerns, but need more information whether that is critical or not.



The bottom line is the driver needs to talk to the Workers. The way how user access the Driver
should be easier to solve such as launching Spark (shell) driver on a specific interface.



Likewise, if you found out any interesting solutions, please let me know. I'll share the solution
once I have something up and running. Currently, it is running ok with iptables off, but still
need to figure out how to product-ionize the security part.



Subject: RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking
communication

To: user@spark.apache.org

From: jeising@us.ibm.com

Date: Fri, 2 May 2014 16:07:50 -0500



Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark opens up dynamic
/ ephemera [2] ports for each job on the shell and the workers.  As you are finding out, this
makes securing and managing the network for Spark very difficult.



> Any idea how to restrict the 'Workers' port range?

The port range can be found by running: 
$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 32768 61000


With that being said, a couple avenues you may try: 

Limit the dynamic ports [3] to a more reasonable number and open all of these ports on your
firewall; obviously, this might have unintended consequences like port exhaustion. 
Secure the network another way like through a private VPN; this may reduce Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeising@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the port in spark-env.sh,
the problem is not the driver port which



From: Andrew Lee <alee526@hotmail.com>

To: "user@spark.apache.org" <user@spark.apache.org>

Date: 05/02/2014 03:15 PM

Subject: RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking
communication







Hi Yana, 



I did. I configured the the port in spark-env.sh, the problem is not the driver port which
is fixed.

it's the Workers port that are dynamic every time when they are launched in the YARN container.
:-(



Any idea how to restrict the 'Workers' port range?



Date: Fri, 2 May 2014 14:49:23 -0400

Subject: Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking
communication

From: yana.kadiyska@gmail.com

To: user@spark.apache.org



I think what you want to do is set spark.driver.port to a fixed port.





On Fri, May 2, 2014 at 1:52 PM, Andrew Lee <alee526@hotmail.com> wrote: 
Hi All,



I encountered this problem when the firewall is enabled between the spark-shell and the Workers.



When I launch spark-shell in yarn-client mode, I notice that Workers on the YARN containers
are trying to talk to the driver (spark-shell), however, the firewall is not opened and caused
timeout.



For the Workers, it tries to open listening ports on 54xxx for each Worker? Is the port random
in such case?

What will be the better way to predict the ports so I can configure the firewall correctly
between the driver (spark-shell) and the Workers? Is there a range of ports we can specify
in the firewall/iptables?



Any ideas?

 		 	   		  
Mime
View raw message