spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: Deploying ML Pipeline Model
Date Tue, 05 Jul 2016 09:42:47 GMT
It all depends on your latency requirements and volume. 100s of queries per
minute, with an acceptable latency of up to a few seconds? Yes, you could
use Spark for serving, especially if you're smart about caching results
(and I don't mean just Spark caching, but caching recommendation results
for example similar items etc).

However for many serving use cases using a Spark cluster is too much
overhead. Bear in mind real-world serving of many models (recommendations,
ad-serving, fraud etc) is one component of a complex workflow (e.g. one
page request in ad tech cases involves tens of requests and hops between
various ad servers and exchanges). That is why often the practical latency
bounds are < 100ms (or way, way tighter for ad serving for example).


On Fri, 1 Jul 2016 at 21:59 Saurabh Sardeshpande <saurabh.ss@gmail.com>
wrote:

> Hi Nick,
>
> Thanks for the answer. Do you think an implementation like the one in this
> article is infeasible in production for say, hundreds of queries per
> minute?
> https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
> The article uses Flask to define routes and Spark for evaluating requests.
>
> Regards,
> Saurabh
>
>
>
>
>
>
> On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
>> Generally there are 2 ways to use a trained pipeline model - (offline)
>> batch scoring, and real-time online scoring.
>>
>> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
>> certainly loading the model back in Spark and feeding new data through the
>> pipeline for prediction works just fine, and this is essentially what is
>> supported in 1.6 (and more or less full coverage in 2.0). For large batch
>> cases this can be quite efficient.
>>
>> However, usually for real-time use cases, the latency required is fairly
>> low - of the order of a few ms to a few 100ms for a request (some examples
>> include recommendations, ad-serving, fraud detection etc).
>>
>> In these cases, using Spark has 2 issues: (1) latency for prediction on
>> the pipeline, which is based on DataFrames and therefore distributed
>> execution, is usually fairly high "per request"; (2) this requires pulling
>> in all of Spark for your real-time serving layer (or running a full Spark
>> cluster), which is usually way too much overkill - all you really need for
>> serving is a bit of linear algebra and some basic transformations.
>>
>> So for now, unfortunately there is not much in the way of options for
>> exporting your pipelines and serving them outside of Spark - the
>> JPMML-based project mentioned on this thread is one option. The other
>> option at this point is to write your own export functionality and your own
>> serving layer.
>>
>> There is (very initial) movement towards improving the local serving
>> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
>> was the "first step" in this process).
>>
>> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <jacek@japila.pl> wrote:
>>
>>> Hi Rishabh,
>>>
>>> I've just today had similar conversation about how to do a ML Pipeline
>>> deployment and couldn't really answer this question and more because I
>>> don't really understand the use case.
>>>
>>> What would you expect from ML Pipeline model deployment? You can save
>>> your model to a file by model.write.overwrite.save("model_v1").
>>>
>>> model_v1
>>> |-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-00000
>>> `-- stages
>>>     |-- 0_regexTok_b4265099cc1c
>>>     |   `-- metadata
>>>     |       |-- _SUCCESS
>>>     |       `-- part-00000
>>>     |-- 1_hashingTF_8de997cf54ba
>>>     |   `-- metadata
>>>     |       |-- _SUCCESS
>>>     |       `-- part-00000
>>>     `-- 2_linReg_3942a71d2c0e
>>>         |-- data
>>>         |   |-- _SUCCESS
>>>         |   |-- _common_metadata
>>>         |   |-- _metadata
>>>         |   `--
>>> part-r-00000-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>>>         `-- metadata
>>>             |-- _SUCCESS
>>>             `-- part-00000
>>>
>>> 9 directories, 12 files
>>>
>>> What would you like to have outside SparkContext? What's wrong with
>>> using Spark? Just curious hoping to understand the use case better.
>>> Thanks.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnext29@gmail.com>
>>> wrote:
>>> > Hi All,
>>> >
>>> > I am looking for ways to deploy a ML Pipeline model in production .
>>> > Spark has already proved to be a one of the best framework for model
>>> > training and creation, but once the ml pipeline model is ready how can
>>> I
>>> > deploy it outside spark context ?
>>> > MLlib model has toPMML method but today Pipeline model can not be
>>> saved to
>>> > PMML. There are some frameworks like MLeap which are trying to abstract
>>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>>> > context,but currently they don't have most of the ml transformers and
>>> > estimators.
>>> > I am looking for related work going on this area.
>>> > Any pointers will be helpful.
>>> >
>>> > Thanks,
>>> > Rishabh.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>

Mime
View raw message