spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
Date Wed, 01 Mar 2017 08:59:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889799#comment-15889799
] 

Apache Spark commented on SPARK-18890:
--------------------------------------

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/17116

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18890
>                 URL: https://issues.apache.org/jira/browse/SPARK-18890
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Kay Ousterhout
>            Priority: Minor
>
>  As part of benchmarking this change: https://github.com/apache/spark/pull/15505 and
alternatives, [~shivaram] and I found that moving task serialization from TaskSetManager (which
happens as part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to
approximately a 10% reduction in job runtime for a job that counted 10,000 partitions (that
each had 1 int) using 20 machines.  Similar performance improvements were reported in the
pull request linked above.  This would appear to be because the TaskSchedulerImpl thread is
the bottleneck, so moving serialization to CGSB reduces runtime.  This change may *not* improve
runtime (and could potentially worsen runtime) in scenarios where the CGSB thread is the bottleneck
(e.g., if tasks are very large, so calling launch to send the tasks to the executor blocks
on the network).
> One benefit of implementing this change is that it makes it easier to parallelize the
serialization of tasks (different tasks could be serialized by different threads).  Another
benefit is that all of the serialization occurs in the same place (currently, the Task is
serialized in TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are better ways
of reducing the serialization time (e.g., by re-using a single serialized object with the
Task/jars/files and broadcasting it for each stage) but I wanted to open this JIRA to document
the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message