spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Python 2.7 + numpy break sortByKey()
Date Wed, 05 Mar 2014 17:02:43 GMT
Devs? Is this an issue for you that deserves a ticket?


On Sun, Mar 2, 2014 at 4:32 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> So this issue appears to be related to the other Python 2.7-related issue
> I reported in this thread<http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-td2184.html>
> .
>
> Shall I open a bug in JIRA about this and include the wikistat repro?
>
> Nick
>
>
> On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Unexpected behavior. Here's the repro:
>>
>>    1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance
>>    type.
>>    2. Upgrade the cluster to Python 2.7 using the instructions here<https://spark-project.atlassian.net/browse/SPARK-922?focusedCommentId=15711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15711>
>>    .
>>    3. pip2.7 install numpy
>>    4. Run this script in the pyspark shell:
>>
>>    wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo
>>    /sample/wiki/pagecounts-20100212-050000.gz')
>>    wikistat = wikistat.map(lambda x: x.split(' ')).cache()
>>    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>>    (x[1],x[0])).sortByKey(False).take(5)
>>
>>    5. You will see a long error output that includes a complaint about
>>    NumPy not being installed.
>>    6. Now remove the sortByKey() from that last line and rerun it.
>>
>>    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>>    (x[1],x[0])).take(5)
>>
>>    You should see your results without issue. So it's the sortByKey()
>>    that's choking.
>>    7. Quit the pyspark shell and pip uninstall numpy.
>>    8. Rerun the three lines from step 4. Enjoy your sorted results
>>    error-free.
>>
>> Can anyone else reproduce this issue? Is it a bug? I don't see it if I
>> leave the cluster on the default Python 2.6.8.
>>
>> Installing numpy on the slave via pssh and pip2.7 (so that it's identical
>> to the master) does not fix the issue. Dunno if installing Python packages
>> everywhere is even necessary though.
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: Python 2.7 + numpy break sortByKey()<http://apache-spark-user-list.1001560.n3.nabble.com/Python-2-7-numpy-break-sortByKey-tp2214.html>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>>
>
>

Mime
View raw message