spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: spark with kerberos
Date Tue, 18 Oct 2016 23:18:30 GMT
(Sorry sent reply via wrong account.. )

Steve,

Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)

Usually you will end up having a local Kerberos set up per cluster.
So your machine accounts (hive, yarn, hbase, etc …) are going to be local  to the cluster.

So you will have to set up some sort of realm trusts between the clusters.

If you’re going to be setting up security (Kerberos … ick! shivers… ;-) you’re going
to want to keep the machine accounts isolated to the cluster.
And the OP said that he didn’t control the other cluster which makes me believe that they
are separate.


I would also think that you would have trouble with the credential… isn’t is tied to a
user at a specific machine?
(Its been a while since I looked at this and I drank heavily to forget Kerberos… so I may
be a bit fuzzy here.)

Thx

-Mike
On Oct 18, 2016, at 2:59 PM, Steve Loughran <stevel@hortonworks.com<mailto:stevel@hortonworks.com>>
wrote:


On 17 Oct 2016, at 22:11, Michael Segel <michael_segel@hotmail.com<mailto:michael_segel@hotmail.com>>
wrote:

@Steve you are going to have to explain what you mean by ‘turn Kerberos on’.

Taken one way… it could mean making cluster B secure and running Kerberos and then you’d
have to create some sort of trust between B and C,



I'd imagined making cluster B a kerberized cluster.

I don't think you need to go near trust relations though —ideally you'd just want the same
accounts everywhere if you can, if not, the main thing is that the user submitting the job
can get a credential for  that far NN at job submission time, and that credential is propagated
all the way to the executors.


Did you mean turn on kerberos on the nodes in Cluster B so that each node becomes a trusted
client that can connect to C

OR

Did you mean to turn on kerberos on the master node (eg edge node) where the data persists
if you collect() it so its off the cluster on to a single machine and then push it from there
so that only that machine has to have kerberos running and is a trusted server to Cluster
C?


Note: In option 3, I hope I said it correctly, but I believe that you would be collecting
the data to a client (edge node) before pushing it out to the secured cluster.





Does that make sense?

On Oct 14, 2016, at 1:32 PM, Steve Loughran <stevel@hortonworks.com<mailto:stevel@hortonworks.com>>
wrote:


On 13 Oct 2016, at 10:50, dbolshak <bolshakov.denis@gmail.com<mailto:bolshakov.denis@gmail.com>>
wrote:

Hello community,

We've a challenge and no ideas how to solve it.

The problem,

Say we have the following environment:
1. `cluster A`, the cluster does not use kerberos and we use it as a source
of data, important thing is - we don't manage this cluster.
2. `cluster B`, small cluster where our spark application is running and
performing some logic. (we manage this cluster and it does not have
kerberos).
3. `cluster C`, the cluster uses kerberos and we use it to keep results of
our spark application, we manage this cluster

Our requrements and conditions that are not mentioned yet:
1. All clusters are in a single data center, but in the different
subnetworks.
2. We cannot turn on kerberos on `cluster A`
3. We cannot turn off kerberos on `cluster C`
4. We can turn on/off kerberos on `cluster B`, currently it's turned off.
5. Spark app is built on top of RDD and does not depend on spark-sql.

Does anybody know how to write data using RDD api to remote cluster which is
running with Kerberos?

If you want to talk to the secure clsuter, C, from code running in cluster B, you'll need
to turn kerberos on there. Maybe, maybe, you could just get away with kerberos being turned
off, but you, the user, launching the application while logged in to kerberos yourself and
so trusted by Cluster C.

one of the problems you are likely to hit with Spark here is that it's only going to collect
the tokens you need to talk to HDFS at the time you launch the application, and by default,
it only knows about the cluster FS. You will need to tell spark about the other filesystem
at launch time, so it will know to authenticate with it as you, then collect the tokens needed
for the application itself to work with kerberos.

spark.yarn.access.namenodes=hdfs://cluster-c:8080

-Steve

ps: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/




Mime
View raw message