spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <>
Subject Re: Revisiting Online serving of Spark models?
Date Mon, 21 May 2018 03:11:41 GMT
Specifically I’d like bring part of the discussion to Model and PipelineModel, and various
ModelReader and SharedReadWrite implementations that rely on SparkContext. This is a big blocker
on reusing  trained models outside of Spark for online serving.

What’s the next step? Would folks be interested in getting together to discuss/get some

From: Felix Cheung <>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <>, Joseph Bradley <>
Cc: dev <>

Huge +1 on this!

From: <> on behalf of Holden Karau <>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <<>>
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of MLlib models
which could be deployed without the big Spark JARs and without a SparkContext or SparkSession.
 There are related commercial offerings like this : ) but the overhead of maintaining those
offerings is pretty high.  Building good APIs within MLlib to avoid copying logic across libraries
will be well worth it.

We've talked about this need at Databricks and have also been syncing with the creators of
MLeap.  It'd be great to get this functionality into Spark itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a Row to the current
Models.  Instead, it would be ideal to have local, lightweight versions of models in mllib-local,
outside of the main mllib package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize elements of Spark
SQL, particularly Rows and Types, which could be moved into a local sql package.
* This architecture may require some awkward APIs currently to have model prediction logic
in mllib-local, local model classes in mllib-local, and regular (DataFrame-friendly) model
classes in mllib.  We might find it helpful to break some DeveloperApis in Spark 3.0 to facilitate
this architecture while making it feasible for 3rd party developers to extend MLlib APIs (especially
in Java).
I agree this could be interesting, and feed into the other discussion around when (or if)
we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to avoid breaking
the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as important as per-Row
transformations, but they would be helpful for batching for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!


On Wed, May 9, 2018 at 7:18 AM, Holden Karau <<>>
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as any to revisit
the online serving situation in Spark ML. DB & other's have done some excellent working
moving a lot of the necessary tools into a local linear algebra package that doesn't depend
on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, but currently
our individual transform/predict methods are private so they either need to copy or re-implement
(or put them selves in org.apache.spark) to access them. How would folks feel about adding
a new trait for ML pipeline stages to expose to do transformation of single element inputs
(or local collections) that could be optionally implemented by stages which support this?
That way we can have less copy and paste code possibly getting out of sync with our model

I think continuing to have on-line serving grow in different projects is probably the right
path, forward (folks have different needs), but I'd love to see us make it simpler for other
projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their own commercial
offerings, but hopefully if we make it easier for everyone the commercial vendors can benefit
as well.


Holden :)



Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.



View raw message