I am getting started with spark and mesos, I already
have spark running on a mesos cluster and I am able to start the scala
spark and pyspark shells, yay!. I still have questions on how to distribute
3rd party python libraries since i want to use stuff like nltk and mlib on
pyspark that requires numpy.
I am using salt for the configuration management so
it is really easy for me to create an anaconda virtual environment and
install all the libraries there on each mesos slave.
My main question is if that's the recommended way of doing it 3rd party libraries?
If the answer its yes, how do i tell pyspark to use that virtual environment (and not the default python) on the spark workers?
notice that there are some addFile addPyFile functions on the
SparkContext but i don't want to distribute the libraries every single
time if I can just do that once by writing some salt states for that. I am specially worried about numpy and its requirements.
Hopefully this makes some sense.