spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tony Kinsley <tkinsle...@gmail.com>
Subject Accessing Secure Hadoop from Mesos cluster
Date Wed, 13 Apr 2016 04:57:33 GMT
I have been working towards getting some spark streaming jobs to run in
Mesos cluster mode (using docker containers) and write data periodically to
a secure HDFS cluster. Unfortunately this does not seem to be well
supported currently in spark (
https://issues.apache.org/jira/browse/SPARK-12909). The problem seems to be
that A) passing in a principal and keytab only get processed if the backend
is yarn, B) all the code for renewing tickets is implemented by the yarn
backend.


My first attempt to get around this problem was to create docker containers
that would use a custom entrypoint to run a process manager. Then have cron
running in each container which would periodically run kinit. I was hoping
this would work since the spark can correctly log in if the TGT exists (at
least from my tests manually kinit’ing and running spark in local mode).
However this hack will not work (currently anyways) as the Mesos scheduler
does not specify whether a shell should be used for the command. Mesos will
default to using the shell and then override the entrypoint of the docker
image with /bin/sh (https://issues.apache.org/jira/browse/MESOS-1770).


Since I have not been able to come up with an acceptable work around I am
looking into the possibility of adding the functionality into Spark, but I
wanted to check in to make sure I was not duplicating others work and also
to get some general advice on a good approach to solving this problem. I
have found this old email chain that talks about some different challenges
associated with authenticating correctly to the NameNodes (
http://comments.gmane.org/gmane.comp.lang.scala.spark.user/14257).


I've noticed that the Yarn security settings are namespaced to be specific
to Yarn and that there is some code that seems to be fairly generic
(AMDelegationTokenRenewer.scala and ExecutorDelegationTokenUpdater for
instance although I'm not sure about the use of the YarnSparkHadoopUtils).
It would seem to me that some of this code could be reused across the
various cluster backends. That said, I am fairly new to working with Hadoop
and Spark, and do not claim to understand the inner workings of Yarn or
Mesos, although I feel much more comfortable with Mesos.


I would definitely appreciate some guidance especially since whatever work
that I or ViaSat (my employer) gets working we would definitely be
interested in contributing it back and would very much want to avoid
maintaining a fork of Spark.

Tony

Mime
View raw message