I’m getting some seemingly invalid results when I collect an RDD. This is happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.

See the following code snippet:

JavaRDD<Thing> rdd= pairRDD.values();
rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) );
rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " + e ) );

I would expect the results from the two outputters to be identical, but instead I see:

RDD Foreach: Thing1
RDD Foreach: Thing2
RDD Foreach: Thing3
RDD Foreach: Thing4
Collected Foreach: Thing1
Collected Foreach: Thing1
Collected Foreach: Thing1
Collected Foreach: Thing2

So essentially the valid entries except for one are replaced by an equivalent number of duplicate objects. I’ve tried various map and filter operations, but the results in the RDD always appear correct until I try to collect() the results. I’ve also found that calling cache() on the RDD materialises the duplication process such that the RDD Foreach displays the duplicates too...

Any suggestions for how I can go about debugging this would be massively appreciated.