spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Hill <greg.h...@RACKSPACE.COM>
Subject Re: pyspark on yarn hdp hortonworks
Date Fri, 05 Sep 2014 19:22:07 GMT
I'm running into a problem getting this working as well.  I have spark-submit and spark-shell
working fine, but pyspark in interactive mode can't seem to find the lzo jar:

java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found

This is in /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar which is in my SPARK_CLASSPATH environment
variable, but that doesn't seem to be picked up by pyspark.

Any ideas?  I can't find much in the way of docs on getting the environment right for pyspark.

Greg

From: Andrew Or <andrew@databricks.com<mailto:andrew@databricks.com>>
Date: Wednesday, September 3, 2014 4:19 PM
To: Oleg Ruchovets <oruchovets@gmail.com<mailto:oruchovets@gmail.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: pyspark on yarn hdp hortonworks

Hi Oleg,

There isn't much you need to do to setup a Yarn cluster to run PySpark. You need to make sure
all machines have python installed, and... that's about it. Your assembly jar will be shipped
to all containers along with all the pyspark and py4j files needed. One caveat, however, is
that the jar needs to be built in maven and not on a Red Hat-based OS,

http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn

In addition, it should be built with Java 6 because of a known issue with building jars with
Java 7 and including python files in them (https://issues.apache.org/jira/browse/SPARK-1718).
Lastly, if you have trouble getting it to work, you can follow the steps I have listed in
a different thread to figure out what's wrong:

http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3cCAMJOb8mr1+ias-SLDz_RfRKe_nA2UUbNmHraC4NUKqYqNUNHuQ@mail.gmail.com%3e

Let me know if you can get it working,
-Andrew





2014-09-03 5:03 GMT-07:00 Oleg Ruchovets <oruchovets@gmail.com<mailto:oruchovets@gmail.com>>:
Hi all.
   I am trying to run pyspark on yarn already couple of days:

http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

I posted exception on previous posts. It looks that I didn't do correct configuration.
  I googled quite a lot and I can't find the steps should be done to configure PySpark running
on Yarn.

Can you please share the steps (critical points) should be configured to use PaSpark on Yarn
( hortonworks distribution) :
  Environment variables.
  Classpath
  copy jars to all machine
  other configuration.

Thanks
Oleg.



Mime
View raw message