I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one.

I would really like to know how spark is supposed to be deployed. For now, I have a ssh key in the master that can login in any worker. start-master.sh and start-slaves.sh work.

According to the docs, I crafted the following command:
 ~/projects/bigdata/spark/spark/bin/spark-submit --py-files /home/javier/projects/bigdata/bdml/dist/bdml-0.0.1.zip --master='spark://' ml/spark_pipeline.py /srv/bdml/raw2json/json-logs.gz

First, when I tried to deploy my project, it was an impossible quest. I was all the time getting module import errors:
Traceback (most recent call last):
  File "/home/javier/projects/bigdata/bdml/ml/spark_pipeline.py", line 10, in <module>
    from .files import get_interesting_files

I tried everything, but there was a moment when I had to hop into scala code to trace that error. Therefore I just merged all the functions of the project in one file.

Then I started to get the following error:
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, org.apache.spark.api.python.PythonExce
ption: Traceback (most recent call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.4, PySpark cannot run with different minor versions

I have specified #!/usr/bin/env python3 in the top of the file, and my spark-env.sh on each worker contains the following lines.
export PYSPARK_PYTHON=python3.4

I had to specify the PYTHONHASHSEED because it wasn't propagating to the workers.

