We were able to reproduce it with a minimal example. I've opened a jira issue:

https://issues.apache.org/jira/browse/SPARK-15825

On Wed, Jun 8, 2016 at 12:43 PM, Koert Kuipers <koert@tresata.com> wrote:
great!

we weren't able to reproduce it because the unit tests use a broadcast-join while on the cluster it uses sort-merge-join. so the issue is in sort-merge-join.

we are now able to reproduce it in tests using spark.sql.autoBroadcastJoinThreshold=-1
it produces weird looking garbled results in the join.
hoping to get a minimal reproducible example soon.

On Wed, Jun 8, 2016 at 10:24 AM, Pete Robbins <robbinspg@gmail.com> wrote:
I just raised https://issues.apache.org/jira/browse/SPARK-15822 for a similar looking issue. Analyzing the core dump from the segv with Memory Analyzer it looks very much like a UTF8String is very corrupt.

Cheers,


On Fri, 27 May 2016 at 21:00 Koert Kuipers <koert@tresata.com> wrote:
hello all,
after getting our unit tests to pass on spark 2.0.0-SNAPSHOT we are now trying to run some algorithms at scale on our cluster.
unfortunately this means that when i see errors i am having a harder time boiling it down to a small reproducible example.

today we are running an iterative algo using the dataset api and we are seeing tasks fail with errors which seem to related to unsafe operations. the same tasks succeed without issues in our unit tests.

i see either:

16/05/27 12:54:46 ERROR executor.Executor: Exception in task 31.0 in stage 21.0 (TID 1073)
java.lang.NegativeArraySizeException
        at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
        at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:359)
        at org.apache.spark.sql.execution.aggregate.SortBasedAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregateExec.scala:74)
        at org.apache.spark.sql.execution.aggregate.SortBasedAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregateExec.scala:71)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:775)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

or alternatively:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe571041cba, pid=2450, tid=140622965913344
#
# JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy

i assume the best thing would be to try to get it to print out the generated code that is causing this?
what switch do i need to use again to do so?
thanks,
koert