spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <java...@gmail.com>
Subject Re: Returned type of Broadcast variable is byte array
Date Thu, 30 Oct 2014 18:02:38 GMT
The byte array turns out to be a serialized ObjectOutputStream that
contains  a Tuple2[ParallelCollectionRDD,Function2].

What then should be done differently in the broadcast code (which follows
the structure of an example taken from mllib)?

assert(crows.isInstanceOf[Array[MVector]])
val bcRows = sc.broadcast(crows)
..

  val arrayVect = bcRows.value



2014-10-30 7:42 GMT-07:00 Stephen Boesch <javadba@gmail.com>:

>
> As a template for creating a broadcast variable, the following code
> snippet within mllib was used:
>
>     val bcIdf = dataset.context.broadcast(idf)
>     dataset.mapPartitions { iter =>
>       val thisIdf = bcIdf.value
>
>
> The new code follows that model:
>
> import org.apache.spark.mllib.linalg.{Vector => MVector}
>   ..
>     assert(crows.isInstanceOf[Array[MVector]])
>     val bcRows = sc.broadcast(crows)
>     val GU = mat.rows.zipWithIndex.mapPartitions { case dataIter =>
>         val arrayVect = bcRows.value         // bcRows.value is seen in
> debugger to be of type Array[Byte] .. ??
>
> That last line is unhappy:
>
>    java.lang.ClassCastException: [B cannot be cast to
> [Lorg.apache.spark.mllib.linalg.Vector;
>
> So the compiler is aware that the return type of the broadcast "value"
> method should be an array of vector (which it should). However the actual
> type is Array[Byte].   Any insights on this?
>
>

Mime
View raw message