spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Spark unit test question
Date Tue, 22 Oct 2013 01:21:42 GMT
Yup, local mode also catches serialization errors. The issue with local variables in the function
happens only if they're not Serializable, and even then, Spark's closure cleaner tries to
eliminate references to them in some cases. But for example here's one thing that wouldn't
work:

class C {
  val f = new File("f") // not Serializable

  val x = 1 // Serializable, but also part of C

  // map closure accesses this.x, which will pass the whole C object
  def doStuff(rdd: RDD[Int]) = rdd.map(_ + x)
}

new C().doStuff(rdd)

Matei

On Oct 21, 2013, at 1:29 PM, Josh Rosen <rosenville@gmail.com> wrote:

> I think that the regular 'local' mode will work for testing serialization; it serializes
both tasks and results in order to catch serialization errors:
> 
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala#L187
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala#L200
> 
> 
> On Mon, Oct 21, 2013 at 1:06 PM, Aaron Davidson <ilikerps@gmail.com> wrote:
> To answer your second question first, you can use the SparkContext format "local-cluster[2,
1, 512]" (instead of "local[2]"), which would create a local test cluster with 2 workers,
each with 1 core and 512 MB of memory. This should allow you to accurately test things like
serialization.
> 
> I don't believe that adding a function-local variable would cause the function to be
unserializable, though. The only concern when shipping around functions is when they refer
to variables outside the function's scope, in which case Spark will automatically ship those
variables to all workers (unless you override this behavior with a broadcast or accumulator
variable).
> 
> 
> On Mon, Oct 21, 2013 at 10:30 AM, Shay Seng <shay@1618labs.com> wrote:
> I'm trying to write a unit test to ensure that some functions I rely on will always serialize
and run correctly on a cluster. 
> In one of these functions I've deliberately added a "val x:Int = 1" which should prevent
this method from being serializable right?
> 
> In the test I've done:
>    sc = new SparkContext("local[2]","test")
>    ...
>    val pdata = sc.parallelize(data)
>    val c = pdata.map().collect()
> 
> The unit tests still complete with no errors; I'm guessing because spark knows that local[2]
doesn't require serialization? Is there someway I can force spark to run like it would do
on a real cluster?
> 
> 
> tks
> shay
> 
> 


Mime
View raw message