spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject Why I didn't see the benefits of using KryoSerializer
Date Tue, 17 Mar 2015 16:01:35 GMT
Hi, I am new to Spark. I tried to understand the memory benefits of using KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G memory. I installed
Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"export SPARK_WORKER_MEMORY=32gexport
SPARK_DRIVER_MEMORY=2gexport SPARK_EXECUTOR_MEMORY=4g
First test case:val log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web UI, I can
see "Size in Memory" is 1787M, and "Fraction Cached" is 70% with 7 cached partitions.This
matched with what I thought, and first count finished about 17s, and 2nd count finished about
6s.
2nd test case after restart the spark-shell:val log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see "Size in Memory" is 1231M, and "Fraction Cached" is 100% with
10 cached partitions. It looks like caching the default "java serialized format" reduce the
memory usage, but coming with a cost that first count finished around 39s and 2nd count finished
around 9s. So the job runs slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" /opt/spark/bin/spark-shellval
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark Properties
of "Environment" shows "











spark.driver.extraJavaOptions


      -Dspark.serializer=org.apache.spark.serializer.KryoSerializer



". This is not there for first 2 test cases.But in the web UI of "Storage", the "Size in Memory"
is 1234M, with 100% "Fraction Cached" and 10 cached partitions. The first count took 46s and
2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both counts. Anything
I did wrong? Why the memory foot print of "MEMORY_ONLY_SER" for KryoSerializer still use the
same size as default Java serializer, with worse duration?
Thanks
Yong 		 	   		  
Mime
View raw message