spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <...@mesosphere.io>
Subject Re: Shared Drivers
Date Tue, 03 Mar 2015 17:32:42 GMT
Hi John,

I think there are limitations with the way drivers are designed that is required a seperate
JVM process per driver, therefore it's not possible without any code and design change AFAIK.

A driver shouldn't stay open past your job life time though, so while not sharing between
apps it shouldn't be wasting as much as you described.

Tim 


> On Feb 27, 2015, at 7:50 AM, John Omernik <john@omernik.com> wrote:
> 
> All - I've asked this question before, and probably due to my own poor comprehension
or my clumsy way I ask the question, I am still unclear on the answer. I'll try again this
time using crude visual aids. 
> 
> I am using iPython Notebooks with Jupyter Hub. (Multi-User notebook server). To make
an environment really smooth for data exploration, I am creating a spark context every time
a notebook is opened.   (See image below) 
> 
> This can cause issues on my "analysis" (Jupyter Hub) server as say the driver uses 1024MB,
each notebook, regardless of how much spark is used, opens up a driver.  Yes, I should probably
set it up to only create the context on demand, however, that will cause additional delay.
Another issue is once they are created, they are not closed until the notebook is halted.
Users could leave notebook kernels running causing additional wasted resources.  
> 
> 
> 
> <Current.png>
> ​
> 
> What I would like to do is share context per user.  Basically, each user on the system
would only get one Spark context. And all adhoc queries or work would be sent through one
driver.  This makes sense to me, as users will often want Spark adhoc capabilities, and this
allows them to sit open, ready for adhoc work, while at the same time, not be over the top
in resource usage, especially if a kernel is left open. 
> 
> <Shared.png>
> ​
> 
> On the mesos list I was made aware of SPARK-5338 which Tim Chen is working on. Based
on conversations with him,  this wouldn't actually completely achieve what I am looking for.
in that each notebook would likely still start a spark context, but at least in this case,
the spark driver would be residing on the cluster, and thereby be resource managed by the
cluster. One thing to note here, if the deisgn is similar to the YARN cluster design, then
my iPython stuff may not work at all with Tim's approach, in that the shells (if I am remember
correctly) don't work in cluster mode on Yarn. 
> 
> <SPARK-5338.png>
> ​
> 
> 
> Barring that though, (the pyshell not working in cluster mode), I was thinking drivers
could be shared per user like I initially proposed, ran on the cluster as Tim proposed, and
the shells still work in cluster mode, that would be ideal. We'd have everything running on
the cluster, and we wouldn't have wasted drivers or left open drivers utilizing resources.

> 
> <Shared-SPARK5338.png>
> 
> 
> 
> 
> So I guess, ideally, what keeps us from 
> 
> A. (in Yarn Cluster mode) using the driver in the cluster
> B. Sharing drivers 
> 
> My guess is I may be missing something fundamental here in how Spark is supposed to work,
but I see this as a more efficient use of resources for this type of work.  I may also looking
into creating some docker containers and see how those work, but ideally I'd like to understand
this at a base level... i.e.  why can't cluster (Yarn and Mesos) contexts be connected  to
like a Spark stand alone cluster context can?
> 
> Thanks!
> 
> 
> John
> 
> ​

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message