spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: KRYO usage details: Need Help
Date Wed, 22 Jan 2014 22:12:07 GMT
You should register each class you plan to use within your RDDs. In your case you only have
an RDD of Strings, so you don’t even really need a registrator (strings are registered by
default). But if you made custom objects you would use one.

To speed up Kryo you can also add kryo.setReferences(false) or set spark.kryo.referenceTracking
= false. This disables tracking of circular references. But in general benchmarking on this
small amount of data, you’ll probably have noise from the JVM starting up.

Matei

On Jan 22, 2014, at 12:27 PM, suman bharadwaj <suman.dna@gmail.com> wrote:

> Hi,
> 
> I'm using the below SPARK Code. Currently i have a file of size 25 MB. And I'm trying
to do a comparative study on Kryo and Java serialization.
> 
> I had couple of questions:
> 
> 1. How do you know which classes to register in Kryo ? [ highlighted in yellow ]
> 2. When data is small, I'm seeing Java Serialization has better performance than Kryo.
so was wondering whether the below code represents  the correct usage of Kryo ?
> 
> import org.apache.spark._
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.KryoRegistrator
> import org.apache.hadoop.io.LongWritable
> import org.apache.hadoop.io.Text
> import org.apache.spark.storage.StorageLevel
> 
> class MyRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
>         kryo.register(classOf[LongWritable])
>         kryo.register(classOf[Text])
>         kryo.register(classOf[Integer])
>         kryo.register(classOf[Array[String]])
>   }
> }
> 
> object HTest {
> 
>   def main(args: Array[String]) {
>         System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>         System.setProperty("spark.kryo.registrator", "MyRegistrator")
>         val sc = new SparkContext("local[4]","Test")
>         val input = sc.textFile("/home/Test/DataSet/cd7a58dc-2053-4811-8463-b144781352ac_000004.csv").persist(StorageLevel.MEMORY_ONLY_SER)
>         println(input.count())
>         Thread.sleep(30000L)
>         println(input.count())
>         Thread.sleep(30000L)
>   }
> }
> 
> Your Help is Highly appreciated.
> 
> Regards,
> SB


Mime
View raw message