spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: Revisiting Online serving of Spark models?
Date Thu, 10 May 2018 16:39:26 GMT
On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <>

> Thanks for bringing this up Holden!  I'm a strong supporter of this.
> Awesome! I'm glad other folks think something like this belongs in Spark.

> This was one of the original goals for mllib-local: to have local versions
> of MLlib models which could be deployed without the big Spark JARs and
> without a SparkContext or SparkSession.  There are related commercial
> offerings like this : ) but the overhead of maintaining those offerings is
> pretty high.  Building good APIs within MLlib to avoid copying logic across
> libraries will be well worth it.
> We've talked about this need at Databricks and have also been syncing with
> the creators of MLeap.  It'd be great to get this functionality into Spark
> itself.  Some thoughts:
> * It'd be valuable to have this go beyond adding transform() methods
> taking a Row to the current Models.  Instead, it would be ideal to have
> local, lightweight versions of models in mllib-local, outside of the main
> mllib package (for easier deployment with smaller & fewer dependencies).
> * Supporting Pipelines is important.  For this, it would be ideal to
> utilize elements of Spark SQL, particularly Rows and Types, which could be
> moved into a local sql package.
> * This architecture may require some awkward APIs currently to have model
> prediction logic in mllib-local, local model classes in mllib-local, and
> regular (DataFrame-friendly) model classes in mllib.  We might find it
> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
> architecture while making it feasible for 3rd party developers to extend
> MLlib APIs (especially in Java).
I agree this could be interesting, and feed into the other discussion
around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in
to avoid breaking the current APIs but I could be wrong on that point.

> * It could also be worth discussing local DataFrames.  They might not be
> as important as per-Row transformations, but they would be helpful for
> batching for higher throughput.
That could be interesting as well.

> I'll be interested to hear others' thoughts too!
> Joseph
> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <> wrote:
>> Hi y'all,
>> With the renewed interest in ML in Apache Spark now seems like a good a
>> time as any to revisit the online serving situation in Spark ML. DB &
>> other's have done some excellent working moving a lot of the necessary
>> tools into a local linear algebra package that doesn't depend on having a
>> SparkContext.
>> There are a few different commercial and non-commercial solutions round
>> this, but currently our individual transform/predict methods are private so
>> they either need to copy or re-implement (or put them selves in
>> org.apache.spark) to access them. How would folks feel about adding a new
>> trait for ML pipeline stages to expose to do transformation of single
>> element inputs (or local collections) that could be optionally implemented
>> by stages which support this? That way we can have less copy and paste code
>> possibly getting out of sync with our model training.
>> I think continuing to have on-line serving grow in different projects is
>> probably the right path, forward (folks have different needs), but I'd love
>> to see us make it simpler for other projects to build reliable serving
>> tools.
>> I realize this maybe puts some of the folks in an awkward position with
>> their own commercial offerings, but hopefully if we make it easier for
>> everyone the commercial vendors can benefit as well.
>> Cheers,
>> Holden :)
>> --
>> Twitter:
> --
> Joseph Bradley
> Software Engineer - Machine Learning
> Databricks, Inc.
> [image:] <>


View raw message