spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: closure serialization behavior driving me crazy
Date Tue, 11 Nov 2014 05:26:43 GMT
Hey Sandy,

Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents
of the objects. In addition, something else that helps is to do the following:

{
  val  _arr = arr
  models.map(... _arr ...)
}

Basically, copy the global variable into a local one. Then the field access from outside (from
the interpreter-generated object that contains the line initializing arr) is no longer required,
and the closure no longer has a reference to that.

I'm really confused as to why Array.ofDim would be so large by the way, but are you sure you
haven't flipped around the dimensions (e.g. it should be 5 x 1800)? A 5-double array will
consume more than 5*8 bytes (probably something like 60 at least), and an array of those will
still have a pointer to each one, so I'd expect that many of them to be more than 80 MB (which
is very close to 1867*5*8).

Matei

> On Nov 10, 2014, at 1:01 AM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:
> 
> I'm experiencing some strange behavior with closure serialization that is totally mind-boggling
to me.  It appears that two arrays of equal size take up vastly different amount of space
inside closures if they're generated in different ways.
> 
> The basic flow of my app is to run a bunch of tiny regressions using Commons Math's OLSMultipleLinearRegression
and then reference a 2D array of the results from a transformation.  I was running into OOME's
and NotSerializableExceptions and tried to get closer to the root issue by calling the closure
serializer directly.
>   scala> val arr = models.map(_.estimateRegressionParameters()).toArray
> 
> The result array is 1867 x 5. It serialized is 80k bytes, which seems about right:
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
>   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 cap=80027]
> 
> If I reference it from a simple function:
>   scala> def func(x: Long) => arr.length
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
> I get a NotSerializableException.
> 
> If I take pains to create the array using a loop:
>   scala> val arr = Array.ofDim[Double](1867, 5)
>   scala> for (s <- 0 until models.length) {
>   | factorWeights(s) = models(s).estimateRegressionParameters()
>   | }
> Serialization works, but the serialized closure for the function is a whopping 400MB.
> 
> If I pass in an array of the same length that was created in a different way, the size
of the serialized closure is only about 90K, which seems about right.
> 
> Naively, it seems like somehow the history of how the array was created is having an
effect on what happens to it inside a closure.
> 
> Is this expected behavior?  Can anybody explain what's going on?
> 
> any insight very appreciated,
> Sandy


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message