spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"
Date Mon, 26 Jan 2015 11:48:35 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291737#comment-14291737
] 

Sean Owen commented on SPARK-2389:
----------------------------------

Why can't N front-ends talk to a process built around one long-running Spark app? I think
that's what the OP is talking about, and can be done right now. One Spark app having many
contexts doesn't quite make sense as an app is a SparkContext.

But [~meken] are you really talking about HA for the driver?

> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
>                 Key: SPARK-2389
>                 URL: https://issues.apache.org/jira/browse/SPARK-2389
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the duration
of the whole application* and run tasks in multiple threads. This has the benefit of isolating
applications from each other, on both the scheduling side (each driver schedules its own tasks)
and executor side (tasks from different applications run in different JVMs). However, it also
means that *data cannot be shared* across different Spark applications (instances of SparkContext)
without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of --driver-- client
processes to share executors and to share (persistent / cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web app servers)
that want to use Spark as a _big computing machine_. Most important is the fact that Spark
is quite good in caching/persisting data in memory / on disk thus removing load from backend
data stores.
> Means: it would be really great to let different --driver-- client JVMs operate on the
same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently against the
same data set.
> As an implicit requirement, RDD persistence could get a TTL for its materialized state.
> With such a feature the overall performance of today's web applications could then be
increased by adding more web app servers, more spark nodes, more nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message