spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Dutt <ashish.du...@gmail.com>
Subject Re: hadoop2.6.0 + spark1.4.1 + python2.7.10
Date Wed, 09 Sep 2015 08:59:46 GMT
Dear Sasha,

What I did was that I installed the parcels on all the nodes of the
cluster. Typically the location was
/opt/cloudera/parcels/CDH5.4.2-1.cdh5.4.2.p0.2
Hope this helps you.

With regards,
Ashish



On Tue, Sep 8, 2015 at 10:18 PM, Sasha Kacanski <skacanski@gmail.com> wrote:

> Hi Ashish,
> Thanks for the update.
> I tried all of it, but what I don't get it is that I run cluster with one
> node so presumably I should have PYspark binaries there as I am developing
> on same host.
> Could you tell me where you placed parcels or whatever cloudera is using.
> My understanding of yarn and spark is that these binaries get compressed
> and packaged with Java to be pushed to work node.
> Regards,
> On Sep 7, 2015 9:00 PM, "Ashish Dutt" <ashish.dutt8@gmail.com> wrote:
>
>> Hello Sasha,
>>
>> I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
>> Your question-  "Error from python worker:
>>   /cube/PY/Python27/bin/python: No module named pyspark"
>>
>> On a single node (ie one server/machine/computer) I installed pyspark
>> binaries and it worked. Connected it to pycharm and it worked too.
>>
>> Next I tried executing pyspark command on another node (say the worker)
>> in the cluster and i got this error message, Error from python worker:
>> PATH: No module named pyspark".
>>
>> My first guess was that the worker is not picking up the path of pyspark
>> binaries installed on the server ( I tried many a things like hard-coding
>> the pyspark path in the config.sh file on the worker- NO LUCK; tried
>> dynamic path from the code in pycharm- NO LUCK... ; searched the web and
>> asked the question in almost every online forum--NO LUCK..; banged my head
>> several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
>> 'watermelon' dropped while brooding on this problem and I installed pyspark
>> binaries on all the worker machines ) Now when I tried executing just the
>> command pyspark on the worker's it worked. Tried some simple program
>> snippets on each worker, it works too.
>>
>> I am not sure if this will help or not for your use-case.
>>
>>
>>
>> Sincerely,
>> Ashish
>>
>> On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski <skacanski@gmail.com>
>> wrote:
>>
>>> Thanks Ashish,
>>> nice blog but does not cover my issue. Actually I have pycharm running
>>> and loading pyspark and rest of libraries perfectly fine.
>>> My issue is that I am not sure what is triggering
>>>
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
>>> 4.1-hadoop2.6.0.jar
>>>
>>> Question is why is yarn not getting python package to run on the single
>>> node via YARN?
>>> Some people are saying run with JAVA 6 due to zip library changes
>>> between 6/7/8, some identified bug w RH, i am on debian,  then some
>>> documentation errors but nothing is really clear.
>>>
>>> i have binaries for spark hadoop and i did just fine with spark sql
>>> module, hive, python, pandas ad yarn.
>>> Locally as i said app is working fine (pandas to spark df to parquet)
>>> But as soon as I move to yarn client mode yarn is not getting packages
>>> required to run app.
>>>
>>> If someone confirms that I need to build everything from source with
>>> specific version of software I will do that, but at this point I am not
>>> sure what to do to remedy this situation...
>>>
>>> --sasha
>>>
>>>
>>> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt <ashish.dutt8@gmail.com>
>>> wrote:
>>>
>>>> Hi Aleksandar,
>>>> Quite some time ago, I faced the same problem and I found a solution
>>>> which I have posted here on my blog
>>>> <https://edumine.wordpress.com/category/apache-spark/>.
>>>> See if that can help you and if it does not then you can check out
>>>> these questions & solution on stackoverflow
>>>> <http://stackoverflow.com/search?q=no+module+named+pyspark> website
>>>>
>>>>
>>>> Sincerely,
>>>> Ashish Dutt
>>>>
>>>>
>>>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski <skacanski@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I am successfully running python app via pyCharm in local mode
>>>>> setMaster("local[*]")
>>>>>
>>>>> When I turn on SparkConf().setMaster("yarn-client")
>>>>>
>>>>> and run via
>>>>>
>>>>> park-submit PysparkPandas.py
>>>>>
>>>>>
>>>>> I run into issue:
>>>>> Error from python worker:
>>>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>>>> PYTHONPATH was:
>>>>>
>>>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>>>>>
>>>>> I am running java
>>>>> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
>>>>> java version "1.8.0_31"
>>>>> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
>>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>>>>>
>>>>> Should I try same thing with java 6/7
>>>>>
>>>>> Is this packaging issue or I have something wrong with configurations
>>>>> ...
>>>>>
>>>>> Regards,
>>>>>
>>>>> --
>>>>> Aleksandar Kacanski
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Aleksandar Kacanski
>>>
>>
>>

Mime
View raw message