spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andras Nemeth <andras.nem...@lynxanalytics.com>
Subject Re: Spark unit testing best practices
Date Fri, 16 May 2014 11:06:17 GMT
Thanks for the answers!

On a concrete example, here is what I did to test my (wrong :) ) hypothesis
before writing my email:
class SomethingNotSerializable {
  def process(a: Int): Int = 2 *a
}
object NonSerializableClosure extends App {
  val sc = new spark.SparkContext(
      "local",
      "SerTest",
      "/home/xandrew/spark-0.9.0-incubating",
      Seq("target/scala-2.10/sparktests_2.10-0.1-SNAPSHOT.jar"))
  val sns = new SomethingNotSerializable
  println(sc.parallelize(Seq(1,2,3))
            .map(sns.process(_))
            .reduce(_ + _))
}

This program prints 12 correctly. If I change "local" to point to my spark
master the code fails on the worker with a NullPointerException in the line
".map(sns.process(_))".

But I have to say that my original assumption that this is a serialization
issue was wrong, as adding extends Serializable to my class does _not_
solve the problem in non-local mode. This seems to be something more
convoluted, the sns reference in my closure is probably not stored by
value, instead I guess it's a by name reference to
"NonSerializableClosure.sns". I'm a bit surprised why this results in a
NullPointerException instead of some error when trying to run the
constructor of this object on the worker. Maybe something to do with the
magic of App.

Anyways, while this is indeed an example of an error that doesn't manifest
in local mode, I guess it turns out to be convoluted enough that we won't
worry about it for now, use local in tests, and I'll ask again if we see
some actual prod vs unittest problems.


On using local-cluster, this does sound like exactly what I had in mind.
But it doesn't seem to work for application developers. It seems to assume
you are running within a spark build (it fails while looking for the file
bin/compute-classpath.sh). So maybe that's a reason it's not documented...

Cheers,
Andras





On Wed, May 14, 2014 at 7:58 PM, Mark Hamstra <mark@clearstorydata.com>wrote:

> Local mode does serDe, so it should expose serialization problems.
>
>
> On Wed, May 14, 2014 at 10:53 AM, Philip Ogren <philip.ogren@oracle.com>wrote:
>
>> Have you actually found this to be true?  I have found Spark local mode
>> to be quite good about blowing up if there is something non-serializable
>> and so my unit tests have been great for detecting this.  I have never seen
>> something that worked in local mode that didn't work on the cluster because
>> of different serialization requirements between the two.  Perhaps it is
>> different when using Kryo....
>>
>>
>>
>> On 05/14/2014 04:34 AM, Andras Nemeth wrote:
>>
>>> E.g. if I accidentally use a closure which has something
>>> non-serializable in it, then my test will happily succeed in local mode but
>>> go down in flames on a real cluster.
>>>
>>
>>
>

Mime
View raw message