spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <sandy.r...@cloudera.com>
Subject Re: closure serialization behavior driving me crazy
Date Tue, 11 Nov 2014 21:30:24 GMT
I tried turning on the extended debug info.  The Scala output is a little
opaque (lots of "- field (class "$iwC$$iwC$$iwC$$iwC$$iwC$$iwC", name:
"$iw", type: "class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC""), but it seems
like, as expected, somehow the full array of OLSMultipleLinearRegression
objects is getting pulled in.

I'm not sure I understand your comment about Array.ofDim being large.  When
serializing the array alone, it only takes up about 80K, which is close to
1867*5*sizeof(double).  The 400MB comes when referencing the array from a
function, which pulls in all the extra data.

Copying the global variable into a local one seems to work.  Much
appreciated, Matei.


On Mon, Nov 10, 2014 at 9:26 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:

> Hey Sandy,
>
> Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the
> JVM to print the contents of the objects. In addition, something else that
> helps is to do the following:
>
> {
>   val  _arr = arr
>   models.map(... _arr ...)
> }
>
> Basically, copy the global variable into a local one. Then the field
> access from outside (from the interpreter-generated object that contains
> the line initializing arr) is no longer required, and the closure no longer
> has a reference to that.
>
> I'm really confused as to why Array.ofDim would be so large by the way,
> but are you sure you haven't flipped around the dimensions (e.g. it should
> be 5 x 1800)? A 5-double array will consume more than 5*8 bytes (probably
> something like 60 at least), and an array of those will still have a
> pointer to each one, so I'd expect that many of them to be more than 80 MB
> (which is very close to 1867*5*8).
>
> Matei
>
> > On Nov 10, 2014, at 1:01 AM, Sandy Ryza <sandy.ryza@cloudera.com> wrote:
> >
> > I'm experiencing some strange behavior with closure serialization that
> is totally mind-boggling to me.  It appears that two arrays of equal size
> take up vastly different amount of space inside closures if they're
> generated in different ways.
> >
> > The basic flow of my app is to run a bunch of tiny regressions using
> Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
> the results from a transformation.  I was running into OOME's and
> NotSerializableExceptions and tried to get closer to the root issue by
> calling the closure serializer directly.
> >   scala> val arr = models.map(_.estimateRegressionParameters()).toArray
> >
> > The result array is 1867 x 5. It serialized is 80k bytes, which seems
> about right:
> >   scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
> >   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
> cap=80027]
> >
> > If I reference it from a simple function:
> >   scala> def func(x: Long) => arr.length
> >   scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
> > I get a NotSerializableException.
> >
> > If I take pains to create the array using a loop:
> >   scala> val arr = Array.ofDim[Double](1867, 5)
> >   scala> for (s <- 0 until models.length) {
> >   | factorWeights(s) = models(s).estimateRegressionParameters()
> >   | }
> > Serialization works, but the serialized closure for the function is a
> whopping 400MB.
> >
> > If I pass in an array of the same length that was created in a different
> way, the size of the serialized closure is only about 90K, which seems
> about right.
> >
> > Naively, it seems like somehow the history of how the array was created
> is having an effect on what happens to it inside a closure.
> >
> > Is this expected behavior?  Can anybody explain what's going on?
> >
> > any insight very appreciated,
> > Sandy
>
>

Mime
View raw message