spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Python 2.7 + numpy break sortByKey()
Date Sun, 02 Mar 2014 21:32:34 GMT
So this issue appears to be related to the other Python 2.7-related issue I
reported in this
thread<http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-td2184.html>
.

Shall I open a bug in JIRA about this and include the wikistat repro?

Nick


On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas <nicholas.chammas@gmail.com
> wrote:

> Unexpected behavior. Here's the repro:
>
>    1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance
>    type.
>    2. Upgrade the cluster to Python 2.7 using the instructions here<https://spark-project.atlassian.net/browse/SPARK-922?focusedCommentId=15711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15711>
>    .
>    3. pip2.7 install numpy
>    4. Run this script in the pyspark shell:
>
>    wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo
>    /sample/wiki/pagecounts-20100212-050000.gz')
>    wikistat = wikistat.map(lambda x: x.split(' ')).cache()
>    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>    (x[1],x[0])).sortByKey(False).take(5)
>
>    5. You will see a long error output that includes a complaint about
>    NumPy not being installed.
>    6. Now remove the sortByKey() from that last line and rerun it.
>
>    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>    (x[1],x[0])).take(5)
>
>    You should see your results without issue. So it's the sortByKey()
>    that's choking.
>    7. Quit the pyspark shell and pip uninstall numpy.
>    8. Rerun the three lines from step 4. Enjoy your sorted results
>    error-free.
>
> Can anyone else reproduce this issue? Is it a bug? I don't see it if I
> leave the cluster on the default Python 2.6.8.
>
> Installing numpy on the slave via pssh and pip2.7 (so that it's identical
> to the master) does not fix the issue. Dunno if installing Python packages
> everywhere is even necessary though.
>
> Nick
>
>
> ------------------------------
> View this message in context: Python 2.7 + numpy break sortByKey()<http://apache-spark-user-list.1001560.n3.nabble.com/Python-2-7-numpy-break-sortByKey-tp2214.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
Nabble.com.
>

Mime
View raw message