Hi Nick,

Thanks for the answer. Do you think an implementation like the one in this article is infeasible in production for say, hundreds of queries per minute? https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2. The article uses Flask to define routes and Spark for evaluating requests.


On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath <nick.pentreath@gmail.com> wrote:
Generally there are 2 ways to use a trained pipeline model - (offline) batch scoring, and real-time online scoring.

For batch (or even "mini-batch" e.g. on Spark streaming data), then yes certainly loading the model back in Spark and feeding new data through the pipeline for prediction works just fine, and this is essentially what is supported in 1.6 (and more or less full coverage in 2.0). For large batch cases this can be quite efficient.

However, usually for real-time use cases, the latency required is fairly low - of the order of a few ms to a few 100ms for a request (some examples include recommendations, ad-serving, fraud detection etc).

In these cases, using Spark has 2 issues: (1) latency for prediction on the pipeline, which is based on DataFrames and therefore distributed execution, is usually fairly high "per request"; (2) this requires pulling in all of Spark for your real-time serving layer (or running a full Spark cluster), which is usually way too much overkill - all you really need for serving is a bit of linear algebra and some basic transformations. 

So for now, unfortunately there is not much in the way of options for exporting your pipelines and serving them outside of Spark - the JPMML-based project mentioned on this thread is one option. The other option at this point is to write your own export functionality and your own serving layer.

There is (very initial) movement towards improving the local serving possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which was the "first step" in this process).

On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <jacek@japila.pl> wrote:
Hi Rishabh,

I've just today had similar conversation about how to do a ML Pipeline
deployment and couldn't really answer this question and more because I
don't really understand the use case.

What would you expect from ML Pipeline model deployment? You can save
your model to a file by model.write.overwrite.save("model_v1").

|-- metadata
|   |-- _SUCCESS
|   `-- part-00000
`-- stages
    |-- 0_regexTok_b4265099cc1c
    |   `-- metadata
    |       |-- _SUCCESS
    |       `-- part-00000
    |-- 1_hashingTF_8de997cf54ba
    |   `-- metadata
    |       |-- _SUCCESS
    |       `-- part-00000
    `-- 2_linReg_3942a71d2c0e
        |-- data
        |   |-- _SUCCESS
        |   |-- _common_metadata
        |   |-- _metadata
        |   `-- part-r-00000-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
        `-- metadata
            |-- _SUCCESS
            `-- part-00000

9 directories, 12 files

What would you like to have outside SparkContext? What's wrong with
using Spark? Just curious hoping to understand the use case better.

Jacek Laskowski
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnext29@gmail.com> wrote:
> Hi All,
> I am looking for ways to deploy a ML Pipeline model in production .
> Spark has already proved to be a one of the best framework for model
> training and creation, but once the ml pipeline model is ready how can I
> deploy it outside spark context ?
> MLlib model has toPMML method but today Pipeline model can not be saved to
> PMML. There are some frameworks like MLeap which are trying to abstract
> Pipeline Model and provide ML Pipeline Model deployment outside spark
> context,but currently they don't have most of the ml transformers and
> estimators.
> I am looking for related work going on this area.
> Any pointers will be helpful.
> Thanks,
> Rishabh.

To unsubscribe e-mail: user-unsubscribe@spark.apache.org