spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Zhurakousky (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
Date Sun, 05 Oct 2014 16:50:35 GMT

     [ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Oleg Zhurakousky updated SPARK-3561:
------------------------------------
    Description: 
Currently Spark provides integration with external resource-managers such as Apache Hadoop
YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN
can be enhanced to provide significantly better utilization of cluster resources for large
scale, batch and/or ETL applications when run alongside other applications (Spark and others)
and services in YARN. 

Proposal: 
The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway
and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed
to end users of Spark. 
The trait will define 4 only operations: 
* hadoopFile 
* newAPIHadoopFile 
* broadcast 
* runJob 

Each method directly maps to the corresponding methods in current version of SparkContext.
JobExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext"
with default implementation containing the existing code from SparkContext, thus allowing
current (corresponding) methods of SparkContext to delegate to such implementation. An integrator
will now have an option to provide custom implementation of DefaultExecutionContext by either
implementing it from scratch or extending form DefaultExecutionContext. 

Please see the attached design doc for more details. 
Pull Request will be posted shortly as well

  was:
Currently Spark _integrates with external resource-managing platforms_ such as Apache Hadoop
YARN and Mesos to facilitate 
execution of Spark DAG in a distributed environment provided by those platforms. 

However, this integration is tightly coupled within Spark's implementation making it rather
difficult to introduce integration points with 
other resource-managing platforms without constant modifications to Spark's core (see comments
below for more details). 

In addition, Spark _does not provide any integration points to a third-party **DAG-like**
and **DAG-capable** execution environments_ native 
to those platforms, thus limiting access to some of their native features 
(e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring
and more) as well as specialization aspects of
such execution environments (open source and proprietary). As an example, inability to gain
access to such features are starting to affect Spark's viability in large scale, batch 
and/or ETL applications. 

Introducing a pluggable architecture would solve both of the issues mentioned above ultimately
benefitting Spark's technology and community by allowing it to 
venture into co-existence and collaboration with a variety of existing Big Data platforms
as well as the once yet to come to the market.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a non-public
api (@DeveloperAPI).
The trait will define 4 operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of SparkContext.
JobExecutionContext implementation will be accessed by SparkContext via 
master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default implementation
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation ensuring binary
and source compatibility with older versions of Spark.  
An integrator will now have an option to provide custom implementation of JobExecutionContext
by either implementing it from scratch or extending form DefaultExecutionContext.
Please see the attached design doc and pull request for more details.


> Expose pluggable architecture to facilitate native integration with third-party execution
environments.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3561
>                 URL: https://issues.apache.org/jira/browse/SPARK-3561
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>             Fix For: 1.2.0
>
>         Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as Apache Hadoop
YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN
can be enhanced to provide significantly better utilization of cluster resources for large
scale, batch and/or ETL applications when run alongside other applications (Spark and others)
and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway
and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed
to end users of Spark. 
> The trait will define 4 only operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> Each method directly maps to the corresponding methods in current version of SparkContext.
JobExecutionContext implementation will be accessed by SparkContext via master URL as "execution-context:foo.bar.MyJobExecutionContext"
with default implementation containing the existing code from SparkContext, thus allowing
current (corresponding) methods of SparkContext to delegate to such implementation. An integrator
will now have an option to provide custom implementation of DefaultExecutionContext by either
implementing it from scratch or extending form DefaultExecutionContext. 
> Please see the attached design doc for more details. 
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message