spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: RDD.SaveAsObject will always use Java Serialization?
Date Thu, 12 Sep 2013 02:12:42 GMT
Hi Wenlei,

This was actually semi-intentional because we wanted a forward-compatible format across Spark
versions. I'm not sure whether that was a good idea (and we didn't promise it will be compatible),
so later we can change it. But for now, if you'd like to use Kryo, I recommend implementing
the same thing that saveAsObjectFile does by hand, on top of Hadoop SequenceFiles. This is
all it does:

  /**
   * Save this RDD as a SequenceFile of serialized objects.
   */
  def saveAsObjectFile(path: String) {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }

With Kryo you might want to use mapPartitions instead of that second map in order to reuse
the serializer across records. This code up here uses arrays to reduce the cost of writing
the Java class information for each record.

Matei

On Sep 10, 2013, at 10:09 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:

> Hi,
> 
> It looks like RDD.SaveAsObject will always use Java Serialization , even I set property
spark.serializer as spark.KryoSerializer ?
> 
> Thank you!
> 
> Best,
> Wenlei


Mime
View raw message