spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <iras...@cloudera.com>
Subject Re: Why I didn't see the benefits of using KryoSerializer
Date Fri, 20 Mar 2015 16:54:38 GMT
Hi Yong,

yes I think your analysis is correct.  I'd imagine almost all serializers
out there will just convert a string to its utf-8 representation.  You
might be interested in adding compression on top of a serializer, which
would probably bring the string size down in almost all cases, but then you
also need to take the time for compression.  Kryo is generally more
efficient than the java serializer on complicated object types.

I guess I'm still a little surprised that kryo is slower than java
serialization for you.  You might try setting
"spark.kryo.referenceTracking" to false if you are just serializing objects
with no circular references.  I think that will improve the performance a
little, though I dunno how much.

It might be worth running your experiments again with slightly more
complicated objects and see what you observe.

Imran


On Thu, Mar 19, 2015 at 12:57 PM, java8964 <java8964@hotmail.com> wrote:

> I read the Spark code a little bit, trying to understand my own question.
>
> It looks like the different is really between
> org.apache.spark.serializer.JavaSerializer and
> org.apache.spark.serializer.KryoSerializer, both having the method named
> writeObject.
>
> In my test case, for each line of my text file, it is about 140 bytes of
> String. When either JavaSerializer.writeObject(140 bytes of String) or
> KryoSerializer.writeObject(140 bytes of String), I didn't see difference in
> the underline OutputStream space usage.
>
> Does this mean that KryoSerializer really doesn't give us any benefit for
> String type? I understand that for primitives types, it shouldn't have any
> benefits, but how about String type?
>
> When we talk about lower the memory using KryoSerializer in spark, under
> what case it can bring significant benefits? It is my first experience with
> the KryoSerializer, so maybe I am total wrong about its usage.
>
> Thanks
>
> Yong
>
> ------------------------------
> From: java8964@hotmail.com
> To: user@spark.apache.org
> Subject: Why I didn't see the benefits of using KryoSerializer
> Date: Tue, 17 Mar 2015 12:01:35 -0400
>
>
> Hi, I am new to Spark. I tried to understand the memory benefits of using
> KryoSerializer.
>
> I have this one box standalone test environment, which is 24 cores with
> 24G memory. I installed Hadoop 2.2 plus Spark 1.2.0.
>
> I put one text file in the hdfs about 1.2G.  Here is the settings in the
> spark-env.sh
>
> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"
> export SPARK_WORKER_MEMORY=32g
> export SPARK_DRIVER_MEMORY=2g
> export SPARK_EXECUTOR_MEMORY=4g
>
> First test case:
> val log=sc.textFile("hdfs://namenode:9000/test_1g/")
> log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
> log.count()
> log.count()
>
> The data is about 3M rows. For the first test case, from the storage in
> the web UI, I can see "Size in Memory" is 1787M, and "Fraction Cached" is
> 70% with 7 cached partitions.
> This matched with what I thought, and first count finished about 17s, and
> 2nd count finished about 6s.
>
> 2nd test case after restart the spark-shell:
> val log=sc.textFile("hdfs://namenode:9000/test_1g/")
> log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
> log.count()
> log.count()
>
> Now from the web UI, I can see "Size in Memory" is 1231M, and "Fraction
> Cached" is 100% with 10 cached partitions. It looks like caching the
> default "java serialized format" reduce the memory usage, but coming with a
> cost that first count finished around 39s and 2nd count finished around 9s.
> So the job runs slower, with less memory usage.
>
> So far I can understand all what happened and the tradeoff.
>
> Now the problem comes with when I tried to test with KryoSerializer
>
> SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer"
> /opt/spark/bin/spark-shell
> val log=sc.textFile("hdfs://namenode:9000/test_1g/")
> log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
> log.count()
> log.count()
>
> First, I saw that the new serializer setting passed in, as proven in the
> Spark Properties of "Environment" shows "
>
> spark.driver.extraJavaOptions
>
>       -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
>   ". This is not there for first 2 test cases.
> But in the web UI of "Storage", the "Size in Memory" is 1234M, with 100%
> "Fraction Cached" and 10 cached partitions. The first count took 46s and
> 2nd count took 23s.
>
> I don't get much less memory size as I expected, but longer run time for
> both counts. Anything I did wrong? Why the memory foot print of "MEMORY_ONLY_SER"
> for KryoSerializer still use the same size as default Java serializer, with
> worse duration?
>
> Thanks
>
> Yong
>

Mime
View raw message