spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingyu Kim <m...@palantir.com>
Subject Re: Task result is serialized twice by serializer and closure serializer
Date Thu, 05 Mar 2015 06:08:06 GMT
Yep, that makes sense. Thanks for the clarification!

Mingyu





On 3/4/15, 8:05 PM, "Patrick Wendell" <pwendell@gmail.com> wrote:

>Yeah, it will result in a second serialized copy of the array (costing
>some memory). But the computational overhead should be very small. The
>absolute worst case here will be when doing a collect() or something
>similar that just bundles the entire partition.
>
>- Patrick
>
>On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim <mkim@palantir.com> wrote:
>> The concern is really just the runtime overhead and memory footprint of
>> Java-serializing an already-serialized byte array again. We originally
>> noticed this when we were using RDD.toLocalIterator() which serializes
>>the
>> entire 64MB partition. We worked around this issue by kryo-serializing
>>and
>> snappy-compressing the partition on the executor side before returning
>>it
>> back to the driver, but this operation just felt redundant.
>>
>> Your explanation about reporting the time taken makes it clearer why
>>it¹s
>> designed this way. Since the byte array for the serialized task result
>> shouldn¹t account for the majority of memory footprint anyways, I¹m okay
>> with leaving it as is, then.
>>
>> Thanks,
>> Mingyu
>>
>>
>>
>>
>>
>> On 3/4/15, 5:07 PM, "Patrick Wendell" <pwendell@gmail.com> wrote:
>>
>>>Hey Mingyu,
>>>
>>>I think it's broken out separately so we can record the time taken to
>>>serialize the result. Once we serializing it once, the second
>>>serialization should be really simple since it's just wrapping
>>>something that has already been turned into a byte buffer. Do you see
>>>a specific issue with serializing it twice?
>>>
>>>I think you need to have two steps if you want to record the time
>>>taken to serialize the result, since that needs to be sent back to the
>>>driver when the task completes.
>>>
>>>- Patrick
>>>
>>>On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim <mkim@palantir.com> wrote:
>>>> Hi all,
>>>>
>>>> It looks like the result of task is serialized twice, once by
>>>>serializer (I.e. Java/Kryo depending on configuration) and once again
>>>>by
>>>>closure serializer (I.e. Java). To link the actual code,
>>>>
>>>> The first one:
>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_
>>>>sp
>>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.
>>>>sc
>>>>ala-23L213&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=enn
>>>>QJ
>>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH
>>>>-9
>>>>WMY_2Z07ulA&s=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aA&e=
>>>> The second one:
>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_
>>>>sp
>>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.
>>>>sc
>>>>ala-23L226&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=enn
>>>>QJ
>>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH
>>>>-9
>>>>WMY_2Z07ulA&s=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxM&e=
>>>>
>>>> This serializes the "value", which is the result of task run twice,
>>>>which affects things like collect(), takeSample(), and
>>>>toLocalIterator(). Would it make sense to simply serialize the
>>>>DirectTaskResult once using the regular "serializer" (as opposed to
>>>>closure serializer)? Would it cause problems when the Accumulator
>>>>values
>>>>are not Kryo-serializable?
>>>>
>>>> Alternatively, if we can assume that Accumator values are small, we
>>>>can
>>>>closure-serialize those, put the serialized byte array in
>>>>DirectTaskResult with the raw task result "value", and serialize
>>>>DirectTaskResult.
>>>>
>>>> What do people think?
>>>>
>>>> Thanks,
>>>> Mingyu
>>

Mime
View raw message