I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one.
I would really like to know how spark is supposed to be deployed. For now, I have a ssh key in the master that can login in any worker. start-master.sh and start-slaves.sh work.
According to the docs, I crafted the following command:
~/projects/bigdata/spark/spark/bin/spark-submit --py-files /home/javier/projects/bigdata/bdml/dist/bdml-0.0.1.zip --master='spark://10.0.0.71:7077
' ml/spark_pipeline.py /srv/bdml/raw2json/json-logs.gz
First, when I tried to deploy my project, it was an impossible quest. I was all the time getting module import errors:
Traceback (most recent call last):
File "/home/javier/projects/bigdata/bdml/ml/spark_pipeline.py", line 10, in <module>
from .files import get_interesting_files
I tried everything, but there was a moment when I had to hop into scala code to trace that error. Therefore I just merged all the functions of the project in one file.
Then I started to get the following error:
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.0.0.73): org.apache.spark.api.python.PythonExce
ption: Traceback (most recent call last):
File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.4, PySpark cannot run with different minor versions
I have specified #!/usr/bin/env python3 in the top of the file, and my spark-env.sh on each worker contains the following lines.
I had to specify the PYTHONHASHSEED because it wasn't propagating to the workers.
I hope you can help me,
|Javier Domingo Cansino|
|Research & Development Engineer|
|All information in this email is confidential|