spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Nguyen <...@adatao.com>
Subject Re: Spark development for undergraduate project
Date Thu, 19 Dec 2013 20:52:36 GMT
+1 to most of Andrew's suggestions here, and while we're in that
neighborhood, how about generalizing something like "wtf-spark" (from the
Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high
academic interest, but it's something people would use many times a
debugging day.

Or am I behind and something like that is already there in 0.8?

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash <andrew@andrewash.com> wrote:

> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
>
> 1. Most places I deploy Spark in don't have internet access.  So I can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers manually.
>  This is less a problem with Spark now that there are binary distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration in
> Java system properties, environment variables, command line parameters, and
> for the standalone cluster deployment mode you need to worry about whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties in the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
> done, but it feels that there are gains to be made in configuration options
> here.  Ideally, I would have one configuration file that can be used in all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for starting,
> stopping, and keeping alive a service -- there are custom init scripts that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper, which
> is widely-used for Java service daemons, supports retries, installation as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
> I keep a service running?" problem when it's been done so well by Tanuki --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick bounce of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> entirely an Akka bug based on previous mailing list discussion with Matei,
> but it'd be awesome if you could use either the hostname or the FQDN or the
> IP address in the Spark URL and not have Akka barf at you.
>
> I've been telling myself I'd look into these at some point but just haven't
> gotten around to them myself yet.  Some day!  I would prioritize these
> requests from most- to least-important as 3, 2, 4, 1.
>
> Andrew
>
>
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
>
> > Or if you're extremely ambitious work in implementing Spark Streaming in
> > Python—
> > Sent from Mailbox for iPhone
> >
> > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <matei.zaharia@gmail.com>
> > wrote:
> >
> > > Hi Matt,
> > > If you want to get started looking at Spark, I recommend the following
> > resources:
> > > - Our issue tracker at http://spark-project.atlassian.net contains
> some
> > issues marked “Starter” that are good places to jump into. You might be
> > able to take one of those and extend it into a bigger project.
> > > - The “contributing to Spark” wiki page covers how to send patches and
> > set up development:
> > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > > - This talk has an intro to Spark internals (video and slides are in
> the
> > comments): http://www.meetup.com/spark-users/events/94101942/
> > > For a longer project, here are some possible ones:
> > > - Create a tool that automatically checks which Scala API methods are
> > missing in Python. We had a similar one for Java that was very useful.
> Even
> > better would be to automatically create wrappers for the Scala ones.
> > > - Extend the Spark monitoring UI with profiling information (to sample
> > the workers and say where they’re spending time, or what data structures
> > consume the most memory).
> > > - Pick and implement a new machine learning algorithm for MLlib.
> > > Matei
> > > On Dec 17, 2013, at 10:43 AM, Matthew Cheah <mccheah@uwaterloo.ca>
> > wrote:
> > >> Hi everyone,
> > >>
> > >> During my most recent internship, I worked extensively with Apache
> > Spark,
> > >> integrating it into a company's data analytics platform. I've now
> become
> > >> interested in contributing to Apache Spark.
> > >>
> > >> I'm returning to undergraduate studies in January and there is an
> > academic
> > >> course which is simply a standalone software engineering project. I
> was
> > >> thinking that some contribution to Apache Spark would satisfy my
> > curiosity,
> > >> help continue support the company I interned at, and give me academic
> > >> credits required to graduate, all at the same time. It seems like too
> > good
> > >> an opportunity to pass up.
> > >>
> > >> With that in mind, I have the following questions:
> > >>
> > >>   1. At this point, is there any self-contained project that I could
> > work
> > >>   on within Spark? Ideally, I would work on it independently, in
> about a
> > >>   three month time frame. This time also needs to accommodate ramping
> > up on
> > >>   the Spark codebase and adjusting to the Scala programming language
> and
> > >>   paradigms. The company I worked at primarily used the Java APIs. The
> > output
> > >>   needs to be a technical report describing the project requirements,
> > and the
> > >>   design process I took to engineer the solution for the requirements.
> > In
> > >>   particular, it cannot just be a series of haphazard patches.
> > >>   2. How can I get started with contributing to Spark?
> > >>   3. Is there a high-level UML or some other design specification for
> > the
> > >>   Spark architecture?
> > >>
> > >> Thanks! I hope to be of some help =)
> > >>
> > >> -Matt Cheah
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message