Hi All,

I'm experiencing a java.lang.NegativeArraySizeException in a pyspark script I have.  I've pasted the full traceback at the end of this email.

I have isolated the line of code in my script which "causes" the exception to occur. Although the exception seems to occur deterministically, it is very unclear why the different variants of the line would cause the exception to occur. Unfortunately, I am only able to reproduce the bug in the context of a large data processing job, and the line of code which must change to reproduce the bug has little meaning out of context.  The bug occurs when I call "map" on an RDD with a function that references some state outside of the RDD (which is presumably bundled up and distributed with the function).  The output of the function is a tuple where the first element is an int and the second element is a list of floats (same positive length every time, as verified by an 'assert' statement).

Given that:
-It's unclear why changes in the line would cause an exception
-The exception comes from within pyspark code
-The exception has to do with negative array sizes (and I couldn't have created a negative sized array anywhere in my python code)
I suspect this is a bug in pyspark.  

Has anybody else observed or reported this bug?  


Traceback (most recent call last):
  File "/home/bmiller1/pipeline/driver.py", line 214, in <module>
  File "/home/bmiller1/pipeline/driver.py", line 203, in main
  File "/home/bmiller1/pipeline/layer/svm_layer.py", line 137, in write_results
    fig, accuracy = _get_results(self.prediction_rdd)
  File "/home/bmiller1/pipeline/layer/svm_layer.py", line 56, in _get_results
    predictions = np.array(prediction_rdd.collect())
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.py", line 723, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.py", line 2026, in _jrdd
    broadcast_vars, self.ctx._javaAccumulator)
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/lib/py4j-", line 701, in __call__
  File "/home/spark/spark-1.1.0-bin-hadoop1/python/lib/py4j-", line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonRDD. Trace:
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:66)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:701)