spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hollin Wilkins <hol...@combust.ml>
Subject Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext
Date Mon, 06 Feb 2017 14:08:22 GMT
Hi All -


We got a number of great questions and ended up adding responses to them on
the MLeap Documentation page, in the FAQ section
<http://mleap-docs.combust.ml/faq.html>. We're also including a "condensed"
version at the bottom of this email.


We appreciate the interest and the discussion around MLeap - going from
research to production has been a key focus for us for a while and we are
very passionate about this topic. We welcome community feedback and support
(code, ideas, use-cases) and aim to make taking ML Pipelines to production
a pleasant experience.


Best,

Hollin and Mikhail


--------------------------


FAQs:


Does MLeap Support Custom Transformers?

Absolutely - our goal is to make writing custom transformers easy. For
documentation writing and contributing custom transformers, see the Custom
Transformers
<http://mleap-docs.combust.ml/mleap-runtime/custom-transformer.html> page.

What is MLeap Runtime’s Inference Performance?

MLeap is optimized to deliver execution of entire MLeap Pipelines in
microseconds
(1/1000s of a millisecond). We provide a benchmarking library
<https://github.com/combust/mleap/tree/master/mleap-benchmark> as part of
MLeap that reports the following response times on a pipeline comprised of
vector assemblers, standard scalers, string indexers, one-hot-encoders and:

   -

   Linear Regression: 6.2 microseconds vs 106 milliseconds with Spark using
   Local Relation
   -

   Random Forest: 6.8 microseconds vs 101 milliseconds with Spark Local
   Relation


What Should Be Considered When Making a Decision Between Using MLeap and
other Serialization/Execution Frameworks?

MLeap serialization is built with the following goals and requirements in
mind:


   -

   It should be easy for developers to add custom transformers in Scala and
   Java (we are adding Python and C support as well)
   -

   Serialization format should be flexible and meet state-of-the-art
   performance requirements. MLeap serializes to protobuf 3, making scalable
   deployment and execution of large pipelines and models like Random Forests
   and Neural Nets possible
   -

   Serialization should be optimized for ML Transformers and Pipelines
   -

   Serialization should be accessible for all environments and platforms,
   including low-level languages like C, C++ and Rust
   -

   Provide a common serialization framework for Spark, Scikit, and
   TensorFlow transformers


Is MLeap Ready For Production?

Yes - MLeap is used in a number of production environments today. MLeap
0.5.0 release provides a stable serialization and execution format for ML
Pipelines. Version 1.0.0 will guarantee backwards compatibility.

Why Not Use a SparkContext With a LocalRelation DataFrame?

APIs relying on Spark Context can be optimized to process queries in ~100ms
- if that meets your requirements, then LocalRelation is a possible
solution. However, MLeap’s use-cases require sub-20ms and in some cases
sub-millisecond response times.

Is Spark MLlib Supported?

Spark ML Pipelines already support a lot of the same transformers and
models that are part of MLlib. In addition, we offer a wrapper around MLlib
SupportVectorMachine in our mleap-spark-extension module. If you find that
something is missing from Spark ML that is found in MLlib, please let us
know or contribute your own wrapper to MLeap.

Does MLeap Work WIth Spark Streaming?

Yes - we will add a tutorial on that in the next few weeks.

How Does TensorFlow Integration Work?

Tensorflow integration works by using the official Tensorflow SWIG
wrappers. We may eventually change this to use JavaCPP bindings, or even
take an erlang-inspired approach and have a separate Tensorflow process for
executing Tensorflow graphs. However we end up doing it, the interface will
stay the same and you will always be able to transform your leap frames
with the TensorflowTransformer.

When Will Scikit-Learn Be Supported?

Scikit-Learn support is currently in beta and we are working to support the
following functionality in the initial release in early March:

   -

   Support for all scikit tranformers that have a corresponding Spark
   transformer
   -

   Provide both serialization and de-serialization of MLeap Bundles
   -

   Provide basic pandas support: Group-by aggregations, joins


How Can I Contribute?

Contribute an Estimator/Transformer from Spark or your own custom
transformer

   -

   Write documentation
   -

   Write a tutorial/walkthrough for an interesting ML problem
   -

   Use MLeap at your company and tell us what you think
   -

   Talk with us on Gitter <https://gitter.im/combust/mleap>


On Mon, Feb 6, 2017 at 12:01 AM, Aseem Bansal <asmbansal2@gmail.com> wrote:

> I agree with you that this is needed. There is a JIRA
> https://issues.apache.org/jira/browse/SPARK-10413
>
> On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das <debasish.das83@gmail.com>
> wrote:
>
>> Hi Aseem,
>>
>> Due to production deploy, we did not upgrade to 2.0 but that's critical
>> item on our list.
>>
>> For exposing models out of PipelineModel, let me look into the ML
>> tasks...we should add it since dataframe should not be must for model
>> scoring...many times model are scored on api or streaming paths which don't
>> have micro batching involved...data directly lands from http or kafka/msg
>> queues...for such cases raw access to ML model is essential similar to
>> mllib model access...
>>
>> Thanks.
>> Deb
>> On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbansal2@gmail.com> wrote:
>>
>>> @Debasish
>>>
>>> I see that the spark version being used in the project that you
>>> mentioned is 1.6.0. I would suggest that you take a look at some blogs
>>> related to Spark 2.0 Pipelines, Models in new ml package. The new ml
>>> package's API as of latest Spark 2.1.0 release has no way to call predict
>>> on single vector. There is no API exposed. It is WIP but not yet released.
>>>
>>> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.das83@gmail.com>
>>> wrote:
>>>
>>>> If we expose an API to access the raw models out of PipelineModel can't
>>>> we call predict directly on it from an API ? Is there a task open to expose
>>>> the model out of PipelineModel so that predict can be called on it....there
>>>> is no dependency of spark context in ml model...
>>>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbansal2@gmail.com> wrote:
>>>>
>>>>>
>>>>>    - In Spark 2.0 there is a class called PipelineModel. I know that
>>>>>    the title says pipeline but it is actually talking about PipelineModel
>>>>>    trained via using a Pipeline.
>>>>>    - Why PipelineModel instead of pipeline? Because usually there is
>>>>>    a series of stuff that needs to be done when doing ML which warrants
an
>>>>>    ordered sequence of operations. Read the new spark ml docs or one
of the
>>>>>    databricks blogs related to spark pipelines. If you have used python's
>>>>>    sklearn library the concept is inspired from there.
>>>>>    - "once model is deserialized as ml model from the store of choice
>>>>>    within ms" - The timing of loading the model was not what I was
>>>>>    referring to when I was talking about timing.
>>>>>    - "it can be used on incoming features to score through
>>>>>    spark.ml.Model predict API". The predict API is in the old mllib package
>>>>>    not the new ml package.
>>>>>    - "why r we using dataframe and not the ML model directly from
>>>>>    API" - Because as of now the new ml package does not have the direct
API.
>>>>>
>>>>>
>>>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <
>>>>> debasish.das83@gmail.com> wrote:
>>>>>
>>>>>> I am not sure why I will use pipeline to do scoring...idea is to
>>>>>> build a model, use model ser/deser feature to put it in the row or
column
>>>>>> store of choice and provide a api access to the model...we support
these
>>>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>>>> spark context in local or distributed mode...once model is deserialized
as
>>>>>> ml model from the store of choice within ms, it can be used on incoming
>>>>>> features to score through spark.ml.Model predict API...I am not clear
on
>>>>>> 2200x speedup...why r we using dataframe and not the ML model directly
from
>>>>>> API ?
>>>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbansal2@gmail.com>
wrote:
>>>>>>
>>>>>>> Does this support Java 7?
>>>>>>> What is your timezone in case someone wanted to talk?
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hollin@combust.ml>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Aseem,
>>>>>>>>
>>>>>>>> We have built pipelines that execute several string indexers,
one
>>>>>>>> hot encoders, scaling, and a random forest or linear regression
at the end.
>>>>>>>> Execution time for the linear regression was on the order
of 11
>>>>>>>> microseconds, a bit longer for random forest. This can be
further optimized
>>>>>>>> by using row-based transformations if your pipeline is simple
to around 2-3
>>>>>>>> microseconds. The pipeline operated on roughly 12 input features,
and by
>>>>>>>> the time all the processing was done, we had somewhere around
1000 features
>>>>>>>> or so going into the linear regression after one hot encoding
and
>>>>>>>> everything else.
>>>>>>>>
>>>>>>>> Hope this helps,
>>>>>>>> Hollin
>>>>>>>>
>>>>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbansal2@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Does this support Java 7?
>>>>>>>>>
>>>>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbansal2@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Is computational time for predictions on the order
of few
>>>>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <
>>>>>>>>>> hollin@combust.ml> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Some of you may have seen Mikhail and I talk
at Spark/Hadoop
>>>>>>>>>>> Summits about MLeap and how you can use it to
build production services
>>>>>>>>>>> from your Spark-trained ML pipelines. MLeap is
an open-source technology
>>>>>>>>>>> that allows Data Scientists and Engineers to
deploy Spark-trained ML
>>>>>>>>>>> Pipelines and Models to a scoring engine instantly.
The MLeap execution
>>>>>>>>>>> engine has no dependencies on a Spark context
and the serialization format
>>>>>>>>>>> is entirely based on Protobuf 3 and JSON.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The recent 0.5.0 release provides serialization
and inference
>>>>>>>>>>> support for close to 100% of Spark transformers
(we don’t yet support ALS
>>>>>>>>>>> and LDA).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> MLeap is open-source, take a look at our Github
page:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/combust/mleap
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Or join the conversation on Gitter:
>>>>>>>>>>>
>>>>>>>>>>> https://gitter.im/combust/mleap
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We have a set of documentation to help get you
started here:
>>>>>>>>>>>
>>>>>>>>>>> http://mleap-docs.combust.ml/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We even have a set of demos, for training ML
Pipelines and
>>>>>>>>>>> linear, logistic and random forest models:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/combust/mleap-demo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Check out our latest MLeap-serving Docker image,
which allows
>>>>>>>>>>> you to expose a REST interface to your Spark
ML pipeline models:
>>>>>>>>>>>
>>>>>>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Several companies are using MLeap in production
and even more
>>>>>>>>>>> are currently evaluating it. Take a look and
tell us what you think! We
>>>>>>>>>>> hope to talk with you soon and welcome feedback/suggestions!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sincerely,
>>>>>>>>>>>
>>>>>>>>>>> Hollin and Mikhail
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message