spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Ishizaki" <ISHIZ...@jp.ibm.com>
Subject Re: tuning - Spark data serialization for cache() ?
Date Tue, 08 Aug 2017 06:20:31 GMT
For Dataframe (and Dataset) cache(), neither Java nor Kryo serialization 
is used. There is no way to use Java or Kryo serialization for 
DataFrame.cache() or Dataset.cache() for in-memory.
Are you talking about serialization to Disk?  In previous mail, I talked 
about only in-memory.

Regards, 
Kazuaki Ishizaki



From:   Ofir Manor <ofir.manor@equalum.io>
To:     Kazuaki Ishizaki <ISHIZAKI@jp.ibm.com>
Cc:     user <user@spark.apache.org>
Date:   2017/08/08 03:12
Subject:        Re: tuning - Spark data serialization for cache() ?



Thanks a lot for the quick pointer!
So, is the advice I linked to in official Spark 2.2 documentation 
misleading? You are saying that Spark 2.2 does not use by Java 
serialization? And the tip to switch to Kyro is also outdated?

Ofir Manor
Co-Founder & CTO | Equalum
Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io

On Mon, Aug 7, 2017 at 8:47 PM, Kazuaki Ishizaki <ISHIZAKI@jp.ibm.com> 
wrote:
For Dataframe (and Dataset), cache() already uses fast 
serialization/deserialization with data compression schemes.

We already identified some performance issues regarding cache(). We are 
working for alleviating these issues in 
https://issues.apache.org/jira/browse/SPARK-14098.
We expect that these PRs will be integrated into Spark 2.3.

Kazuaki Ishizaki



From:        Ofir Manor <ofir.manor@equalum.io>
To:        user <user@spark.apache.org>
Date:        2017/08/08 02:04
Subject:        tuning - Spark data serialization for cache() ?




Hi,
I'm using Spark 2.2, and have a big batch job, using dataframes (with 
built-in, basic types). It references the same intermediate dataframe 
multiple times, so I wanted to try to cache() that and see if it helps, 
both in memory footprint and performance.

Now, the Spark 2.2 tuning page (
http://spark.apache.org/docs/latest/tuning.html) clearly says:
1. The default Spark serialization is Java serialization.
2. It is recommended to switch to Kyro serialization.
3. "Since Spark 2.0.0, we internally use Kryo serializer when shuffling 
RDDs with simple types, arrays of simple types, or string type".

Now, I remember that in 2.0 launch, there were discussion of a third 
serialization format that is much more performant and compact. (Encoder?), 
but it is not referenced in the tuning guide and its Scala doc is not very 
clear to me. Specifically, Databricks shared some graphs etc of how much 
it is better than Kyro and Java serialization - see Encoders here:
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html


So, is that relevant to cache()? If so, how can I enable it - and is it 
for MEMORY_AND_DISK_ONLY or MEMORY_AND_DISK_SER?

I tried to play with some other variations, like enabling Kyro by the 
tuning guide instructions, but didn't see any impact on the cached 
dataframe size (same tens of GBs in the UI). So any tips around that?

Thanks.
Ofir Manor
Co-Founder & CTO | Equalum
Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io





Mime
View raw message