spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tristan Blakers <tris...@blackfrog.org>
Subject Re: Incorrect results when calling collect() ?
Date Thu, 18 Dec 2014 10:42:29 GMT
Suspected the same thing, but because the underlying data classes are
deserialised by Avro I think they have to be mutable as you need to provide
the no-args constructor with settable fields.

Nothing is being cached in my code anywhere, and this can be reproduced
using data directly out of the newAPIHadoopRDD() call. Debugs added to the
constructors of the various classes show that the right number are being
constructed, though the watches set on some of the fields aren’t always
triggering, so suspect maybe the serialisation is doing something a bit too
clever?

Tristan

On 18 December 2014 at 21:25, Sean Owen <sowen@cloudera.com> wrote:
>
> It sounds a lot like your values are mutable classes and you are
> mutating or reusing them somewhere? It might work until you actually
> try to materialize them all and find many point to the same object.
>
> On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers <tristan@blackfrog.org>
> wrote:
> > Hi,
> >
> > I’m getting some seemingly invalid results when I collect an RDD. This is
> > happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
> >
> > See the following code snippet:
> >
> > JavaRDD<Thing> rdd= pairRDD.values();
> > rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) );
> > rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " +
> e )
> > );
> >
> > I would expect the results from the two outputters to be identical, but
> > instead I see:
> >
> > RDD Foreach: Thing1
> > RDD Foreach: Thing2
> > RDD Foreach: Thing3
> > RDD Foreach: Thing4
> > (…snip…)
> > Collected Foreach: Thing1
> > Collected Foreach: Thing1
> > Collected Foreach: Thing1
> > Collected Foreach: Thing2
> >
> > So essentially the valid entries except for one are replaced by an
> > equivalent number of duplicate objects. I’ve tried various map and filter
> > operations, but the results in the RDD always appear correct until I try
> to
> > collect() the results. I’ve also found that calling cache() on the RDD
> > materialises the duplication process such that the RDD Foreach displays
> the
> > duplicates too...
> >
> > Any suggestions for how I can go about debugging this would be massively
> > appreciated.
> >
> > Cheers
> > Tristan
>

Mime
View raw message