spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Middleware-wrappers for Spark
Date Tue, 17 Jan 2017 17:38:53 GMT
On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz <rahvin@gmail.com> wrote:

> * Oryx2 - This was more focused on a particular issue, and looked to be a
> very nice framework for deploying real-time analytics --- but again, no
> real traction. In fact, I've heard of PoCs being done by/for Cloudera, to
> demo Lambda-Architectures with Spark, and this was not showcased.
>

This one is not like the others IMHO (I'm mostly the author). It definitely
doesn't serve access to Spark jobs. It's ML-focused, hence, much more
narrowly applicable than what 'lambda' would encompass. In practice it's
used as an application, for recommendations, only. Niche, but does what it
does well. It isn't used as a general platform by even Cloudera. It's
framed as a reference architecture for an app.


>
> * Livy - Although Livy still appears to live, I'm not really seeing the
> progress, that I anticipated after first hearing about it at the 2015 Spark
> Summit Europe. Maybe it's because the documentation isn't quite there yet,
> maybe it's because features are missing -- somehow from my last look at it,
> it's not enterprise-ready quite yet, while offering a feature-set that
> should be driving enterprise adoption.
>

This and Job Server (and about 3 other job server tools) do the mostly same
thing. Livy is a Cloudera project that began to support a notebook-like
tool in Hue. I think this didn't really go live because of grander plans
that should emerge from the Sense acquisition. Livy's still definitely
active, and I think it was created instead of adopting another tool at the
time because it was deemed easier to build in the enterprise-y requirements
like security from scratch. Go figure. I don't know how much Livy is really
meant to be a general tool pushed for general consumption. It has existed
to support CDH-related notebook tools, as I understand it, to date.

(Even though I'm @cloudera.com I don't interface directly with either of
the above-mentioned teams so take my comment with a little grain of salt)


>
> * Mist - Just discovered it today, thinking, "great, ANOTHER middleware"
> and prompting this post. It looks quite fully featured, but can it succeed?
> On the plus side, it's linked to a small, focused business, on the down
> side it's linked to a small, focused business. Positive, since that drives
> development along nicely; negative, since it inhibits adoption in the
> enterprise space.
>

Less like the others, if it's a model-serving tool. It sounds like
OpenScoring in some ways. But yes does seem like it tries to expose access
to Spark jobs. I hadn't heard of it until you mentioned it.



> Now, with that said - why did these products not gain bigger traction? Is
> it because Spark isn't quite ready yet? Is it because of a missed marketing
> opportunity?
>

I have a collection of guesses. First, you may be surprised how early it is
for most companies to be using Spark in a basic way, let alone with
'middleware'.

Maybe none are all that mature? none that i know have vendor backing (Livy
is not formally supported by Cloudera, even). Maybe the small market is
fragmented?

Not all of these things do quite the same thing. The space of 'apps' and
'middleware' is big.

Not all (many?) use cases require long-running Spark jobs. That is what
these tools provide. It's not what Spark was built for. Using it this way
has rough edges. I actually think it's this, relative lack of demand.


>
> And on another note: Should Spark integrate such a wrapper "by default"?
> It's a step further on from the SparkSQL Thrift interface, towards offering
> not just programming API's, but service-APIs. Considering that there are so
> many different interpretations of how this should be solved, bundling the
> effort into a default-implementation could be beneficial. On the other
> hand, feature creep of this magnitude probably isn't desirable.
>

I don't think so, mostly because there's no strong reason to bless one and
reject the others, and because I still think this isn't something Spark was
built for. Spark is already a very large project and there has to be some
boundary where it ends and the ecosystem begins. There are dis-economies
(?) of scale for OSS projects.



> Also, I'm looking at this with my enterprise-glasses on: So fine-grained
> user authorization and authentication features are very important, as are
> consistency and resiliency features. Since long-running interactive
> Spark-jobs are still a mixed bag stability-wise, this
>

Security integration is the big issue as I understand. I don't think any of
these tools can fully guarantee resiliency and consistency in the general
case. A Spark job can have one driver only and there is no HA. Resource
managers already manage restarting failed drivers. I don't know if that's
the issue.


layer of middleware should provide a necessary buffer between crashes of
> the driver program, and serving results.
> Ecosystem support is also a must - why aren't there Tableau connectors for
> (some of) these APIs? [Because they're too obscure...]
>

(It's much easier to plug Tableau into Impala via ODBC to do Tableau-like
things on the same Parquet-formatted data you'd access in Spark.)

Mime
View raw message