spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aureliano Buendia <buendia...@gmail.com>
Subject Why does saveAfObjectFile() serialize Array[T] instead of T?
Date Sun, 05 Jan 2014 19:52:30 GMT
Hi,

Given an RDD[T] instance, saveAfObjectFile() passes an instance of Array[T]
to serialize(), and not and instance of T:

  def saveAsObjectFile(path: String) {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(*x* => (NullWritable.get(), new BytesWritable(
*Utils.serialize(x)*)))
      .saveAsSequenceFile(path)
  }

Is this array mapping for efficiency reasons, or are there other reasons
for this?

I'm trying to use saveAfObjectFile() to serialize protobuf messages.
Protobuf messages already come with a method that turns them into
Array[Byte] (see
here<https://github.com/osmandapp/Osmand/blob/master/OsmAnd-java/src/com/google/protobuf/AbstractMessageLite.java#L60>),
that is, toByteArray() can be clled on an instance of T. Considering this,
how can a protobuf message instance be serialized in a custom version of
saveAsObjectFile()?

Is it a good idea to drop array mapping?:

def saveAsObjectFile(path: String) {
    this.map(x => (NullWritable.get(), new BytesWritable(*x.toByteArray()*
)))
      .saveAsSequenceFile(path)
  }

Mime
View raw message