spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rhettbutler <17647...@sun.ac.za>
Subject Pyspark not running the sqlContext in Pycharm
Date Fri, 02 Mar 2018 07:19:19 GMT
I hope someone could help with this problem I am having. I have previously
setup a VM in windows using CENTOS, with hadoop and spark (all in
singlenode) and it was working perfectly.

I am now running a multinode setup with another computer, both running
CENTOS standalone. I have installed hadoop successfully and is running on
both machines. Then I've installed spark with the following setup:

Version : Spark 2.2.1-bin-hadoop2.7, with the .bashrc file as follows:

export SPARK_HOME=/opt/spark/spark-2.2.1-bin-hadoop2.7

export PATH=$PATH:$SPARK_HOME/bin

export PATH="/home/hadoop/anaconda2/bin:$PATH"



I am using anaconda (python 2.7 version) to install the pyspark packages. I
then have the $SPARK_HOME/conf files setup as follows:

the slaves file as:

datanode1

(the hostname of the node which i use to conduct the processing on)

and the spark-env.sh file:

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk

export HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.8.3/etc/hadoop

export SPARK_WORKER_CORES=6


The idea is that I then connect the spark to PyCharm IDE to do my work on.
In Pycharm I have setup the environment variables (under run -> edit
configurations) as

PYTHON PATH /opt/spark/spark-2.2.1-bin-hadoop2.7/python/lib

SPARK_HOME /opt/spark/spark-2.2.1-bin-hadoop2.7

An image of the environment variables: 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t9029/UKaNp.png> 


I have also setup my python interpreter to point to the anaconda python
directory. With all this setup I get multiple errors as output when I call
either a spark SQLContext or SparkSession.Builder, for example:

conf = SparkConf().setMaster("local[*]")

sc = SparkContext(conf=conf)

sql_sc = SQLContext(sc)


or

spark =
SparkSession.builder.master("local").appName("PythonTutPrac").config("spark.executor.memory","2gb").getOrCreate()



The ERROR given:

File "/home/hadoop/Desktop/PythonPrac/CollaborativeFiltering.py", line 72,
in .config("spark.executor.memory", "2gb") \ File
"/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line
183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value) File
"/home/hadoop/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py",
line 1160, in call answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 79, in deco raise IllegalArgumentException(s.split(': ', 1)1,
stackTrace) pyspark.sql.utils.IllegalArgumentException: u"Error while
instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':"
Unhandled exception in thread started by > Process finished with exit code 1

An image of the error:
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t9029/A3D0u.png> 


I do not know why this error message is showing, when I was running this in
my VM single node, it was working fine. I then decided in my multinode setup
to remove the datanode1 and just run it again as a singlenode setup with my
main computer (hostname - master), but still getting the same errors.

I hope someone could help, as I have followed other guides to setup pycharm
with pyspark, but could not figure out what is going wrong. Thanks!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message