spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenlei Xie <wenlei....@gmail.com>
Subject Re: RDD.SaveAsObject will always use Java Serialization?
Date Thu, 12 Sep 2013 04:36:37 GMT
I see. Thank you!


On Wed, Sep 11, 2013 at 7:12 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Hi Wenlei,
>
> This was actually semi-intentional because we wanted a forward-compatible
> format across Spark versions. I'm not sure whether that was a good idea
> (and we didn't promise it will be compatible), so later we can change it.
> But for now, if you'd like to use Kryo, I recommend implementing the same
> thing that saveAsObjectFile does by hand, on top of Hadoop SequenceFiles.
> This is all it does:
>
>   /**
>    * Save this RDD as a SequenceFile of serialized objects.
>    */
>   def saveAsObjectFile(path: String) {
>     this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
>       .map(x => (NullWritable.get(), new
> BytesWritable(Utils.serialize(x))))
>       .saveAsSequenceFile(path)
>   }
>
> With Kryo you might want to use mapPartitions instead of that second map
> in order to reuse the serializer across records. This code up here uses
> arrays to reduce the cost of writing the Java class information for each
> record.
>
> Matei
>
> On Sep 10, 2013, at 10:09 PM, Wenlei Xie <wenlei.xie@gmail.com> wrote:
>
> > Hi,
> >
> > It looks like RDD.SaveAsObject will always use Java Serialization , even
> I set property spark.serializer as spark.KryoSerializer ?
> >
> > Thank you!
> >
> > Best,
> > Wenlei
>
>


-- 
Wenlei Xie (谢文磊)

Department of Computer Science
5132 Upson Hall, Cornell University
Ithaca, NY 14853, USA
Phone: (607) 255-5577
Email: wenlei.xie@gmail.com

Mime
View raw message