spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.
Date Sun, 05 Oct 2014 16:15:34 GMT


Sean Owen commented on SPARK-3561:

I'd be interested to see a more specific motivating use case. Is this about using Tez for
example, and where does it help to stack Spark on Tez on YARN? or MR2, etc. Spark Core and
Tez overlap, to be sure, and I'm not sure how much value it adds to run one on the other.
Kind of like running Oracle on MySQL or something. For whatever it is: is it maybe not more
natural to integrate the feature into Spark itself?

It would be great if it this were all just a matter of one extra trait and interface. In practice
I suspect there are a number of hidden assumptions throughout the code that may leak through
attempts at this abstraction. 

I am definitely asking rather than asserting, curious to see more specifics about the upside.

> Expose pluggable architecture to facilitate native integration with third-party execution
> -------------------------------------------------------------------------------------------------------
>                 Key: SPARK-3561
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>             Fix For: 1.2.0
>         Attachments: SPARK-3561.pdf
> Currently Spark _integrates with external resource-managing platforms_ such as Apache
Hadoop YARN and Mesos to facilitate 
> execution of Spark DAG in a distributed environment provided by those platforms. 
> However, this integration is tightly coupled within Spark's implementation making it
rather difficult to introduce integration points with 
> other resource-managing platforms without constant modifications to Spark's core (see
comments below for more details). 
> In addition, Spark _does not provide any integration points to a third-party **DAG-like**
and **DAG-capable** execution environments_ native 
> to those platforms, thus limiting access to some of their native features 
> (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management and monitoring
and more) as well as specialization aspects of
> such execution environments (open source and proprietary). As an example, inability to
gain access to such features are starting to affect Spark's viability in large scale, batch

> and/or ETL applications. 
> Introducing a pluggable architecture would solve both of the issues mentioned above ultimately
benefitting Spark's technology and community by allowing it to 
> venture into co-existence and collaboration with a variety of existing Big Data platforms
as well as the once yet to come to the market.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) - as a
non-public api (@DeveloperAPI).
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of SparkContext.
JobExecutionContext implementation will be accessed by SparkContext via 
> master URL as with default implementation
containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation ensuring binary
and source compatibility with older versions of Spark.  
> An integrator will now have an option to provide custom implementation of DefaultExecutionContext
by either implementing it from scratch or extending form DefaultExecutionContext.
> Please see the attached design doc and pull request for more details.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message