spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Jaggi <>
Subject Re: MEMORY_ONLY_SER question
Date Wed, 05 Nov 2014 06:01:24 GMT
I used the word "streaming" but I did not mean to refer to spark streaming.
I meant if a partition containing 10 objects was kryo-serialized into a
single buffer, then in a mapPartitions() call, as I call 10
times to access these objects one at a time, does the deserialization happen
a) once to get all 10 objects,
b) 10 times "incrementally" to get an object at a time, or
c) 10 times to get 10 objects and discard the "wrong" 9 objects [ i doubt
this would a design anyone would have adopted ]
I think your answer is option (a) and you refered to Spark streaming to
indicate that there is no difference in its behavior from spark

If it is indeed option (a), I am happy with it and don't need to customize.
If it is (b), I would like to have (a) instead.

I am also wondering if kryo is good at compression of strings and numbers.
Often I have the data type as "Double" but it could be encoded in much
fewer bits.

On Tue, Nov 4, 2014 at 1:02 PM, Tathagata Das <>

> It it deserialized in a streaming manner as the iterator moves over the
> partition. This is a functionality of core Spark, and Spark Streaming just
> uses it as is.
> What do you want to customize it to?
> On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi <> wrote:
>> Folks,
>> If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed
>> for a transformation/action later, is the whole partition of the RDD
>> deserialized into Java objects first before my transform/action code works
>> on it? Or is it deserialized in a streaming manner as the iterator moves
>> over the partition? Is this behavior customizable? I generally use the Kryo
>> serializer.
>> Mohit.

View raw message