I saw the bug fix. I am using the latest Spark available on AWS EMR which I think is 2.01. I am at work and can't check my home config. I don't think AWS merged in this fix.


On Tue, Apr 4, 2017 at 4:42 PM, Jeff Zhang <zjffdu@gmail.com> wrote:

It is fixed in https://issues.apache.org/jira/browse/SPARK-13330

Holden Karau <holden@pigscanfly.ca>于2017年4月5日周三 上午12:03写道:
Which version of Spark is this (or is it a dev build)? We've recently made some improvements with PYTHONHASHSEED propagation.

On Tue, Apr 4, 2017 at 7:49 AM Eike von Seggern <eike.seggern@seven cal.com> wrote:
2017-04-01 21:54 GMT+02:00 Paul Tremblay <paulhtremblay@gmail.com>:
When I try to to do a groupByKey() in my spark environment, I get the error described here:


In order to attempt to fix the problem, I set up my ipython environment with the additional line:


When I fire up my ipython shell, and do:

In [7]: hash("foo")
Out[7]: -2457967226571033580

In [8]: hash("foo")
Out[8]: -2457967226571033580

So my hash function is now seeded so it returns consistent values. But when I do a groupByKey(), I get the same error:

Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

Anyone know how to fix this problem in python 3.4?

Independent of the python version, you have to ensure that Python on spark-master and -workers is started with PYTHONHASHSEED set, e.g. by adding it to the environment of the spark processes.



Paul Henry Tremblay
Robert Half Technology