spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Data Security on Spark-on-HDFS
Date Mon, 31 Aug 2015 12:09:43 GMT

> On 31 Aug 2015, at 11:02, Daniel Schulz <danielschulz2005@hotmail.com> wrote:
> 
> Hi guys,
> 
> In a nutshell: does Spark check and respect user privileges when reading/writing data.

Yes, in a locked down YARN cluster —until your tokens expire

> 
> I am curious about the data security when Spark runs on top of HDFS — maybe though
YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction
when accessing data? So is there a shortcoming when using Spark because the JVM processes
are already running and therefore the launching user is omitted by Spark when accessing data
residing on HDFS? Or is Spark only reading/writing data, that the user had access to, that
launched this Thread?


in a kerberized YARN cluster, the processes run as the specific user submitting the job (or
whoever the kerberos ID -> OS ID mapping files say they are), with the delegated tokens
passed up from the client to talk to HDFS. In Spark 1.5 you get the Hive credentials pushed
up too.

This means that access is granted with the rights of the user deploying the application, HDFS
checking it on every request.

It also means that when the HDFS delegation tokens expire, your HDFS access goes away. Spark
1.5 addresses this by allowing you to optionally provide a keytab for the app master, which
is used to re-authenticate with the KDC, and then HDFS. This changes the problem to "getting
your cluster ops team to give you a keytab"

the New ORA book, Hadoop Security, is the best start to Hadoop cluster security; Spending
some money on the eBook is a worthwhile investment


I'm doing a low-level document on the internals at https://github.com/steveloughran/kerberos_and_hadoop/
—though that's targeted at developers and people debugging their code more than users of
apps



> 
> What about local store when running in Standalone mode? What about access calls to HBase
or Hive then?
> 

Someone else will have to cover that
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Mime
View raw message