spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Task result is serialized twice by serializer and closure serializer
Date Thu, 05 Mar 2015 04:05:28 GMT
Yeah, it will result in a second serialized copy of the array (costing
some memory). But the computational overhead should be very small. The
absolute worst case here will be when doing a collect() or something
similar that just bundles the entire partition.

- Patrick

On Wed, Mar 4, 2015 at 5:47 PM, Mingyu Kim <mkim@palantir.com> wrote:
> The concern is really just the runtime overhead and memory footprint of
> Java-serializing an already-serialized byte array again. We originally
> noticed this when we were using RDD.toLocalIterator() which serializes the
> entire 64MB partition. We worked around this issue by kryo-serializing and
> snappy-compressing the partition on the executor side before returning it
> back to the driver, but this operation just felt redundant.
>
> Your explanation about reporting the time taken makes it clearer why it¹s
> designed this way. Since the byte array for the serialized task result
> shouldn¹t account for the majority of memory footprint anyways, I¹m okay
> with leaving it as is, then.
>
> Thanks,
> Mingyu
>
>
>
>
>
> On 3/4/15, 5:07 PM, "Patrick Wendell" <pwendell@gmail.com> wrote:
>
>>Hey Mingyu,
>>
>>I think it's broken out separately so we can record the time taken to
>>serialize the result. Once we serializing it once, the second
>>serialization should be really simple since it's just wrapping
>>something that has already been turned into a byte buffer. Do you see
>>a specific issue with serializing it twice?
>>
>>I think you need to have two steps if you want to record the time
>>taken to serialize the result, since that needs to be sent back to the
>>driver when the task completes.
>>
>>- Patrick
>>
>>On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim <mkim@palantir.com> wrote:
>>> Hi all,
>>>
>>> It looks like the result of task is serialized twice, once by
>>>serializer (I.e. Java/Kryo depending on configuration) and once again by
>>>closure serializer (I.e. Java). To link the actual code,
>>>
>>> The first one:
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
>>>ala-23L213&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ
>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
>>>WMY_2Z07ulA&s=cSKekTNmnB0g54h6-FaF-zOL46UZC_1_LdKK3p9Q0aA&e=
>>> The second one:
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_blob_master_core_src_main_scala_org_apache_spark_executor_Executor.sc
>>>ala-23L226&d=AwIFAw&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=ennQJ
>>>q47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=dw_fNvxBZ1DixNDGBTXRZBKn36QFyH-9
>>>WMY_2Z07ulA&s=PFoz0HyINd2XuiqkHPgyMsOh9eFkCwXOdl9zdxfBwxM&e=
>>>
>>> This serializes the "value", which is the result of task run twice,
>>>which affects things like collect(), takeSample(), and
>>>toLocalIterator(). Would it make sense to simply serialize the
>>>DirectTaskResult once using the regular "serializer" (as opposed to
>>>closure serializer)? Would it cause problems when the Accumulator values
>>>are not Kryo-serializable?
>>>
>>> Alternatively, if we can assume that Accumator values are small, we can
>>>closure-serialize those, put the serialized byte array in
>>>DirectTaskResult with the raw task result "value", and serialize
>>>DirectTaskResult.
>>>
>>> What do people think?
>>>
>>> Thanks,
>>> Mingyu
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message