spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aseem Bansal <asmbans...@gmail.com>
Subject Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?
Date Thu, 01 Sep 2016 13:21:27 GMT
I understand from a theoretical perspective that the model itself is not
distributed. Thus it can be used for making predictions for a vector or a
RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I
create a model from a large data the recommended way is to use the ml
library for fit. I have the option of getting a
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html
 or wrapping it as
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html

Both of these do not have any method which supports Vectors. How do I
bridge this gap in the API from my side? Is there anything in Spark's API
which I have missed? Or do I need to extract the parameters and use another
library for the predictions for a single row?

On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <sowen@cloudera.com> wrote:

> How the model is built isn't that related to how it scores things.
> Here we're just talking about scoring. NaiveBayesModel can score
> Vector which is not a distributed entity. That's what you want to use.
> You do not want to use a whole distributed operation to score one
> record. This isn't related to .ml vs .mllib APIs.
>
> On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <asmbansal2@gmail.com> wrote:
> > I understand your point.
> >
> > Is there something like a bridge? Is it possible to convert the model
> > trained using Dataset<Row> (i.e. the distributed one) to the one which
> uses
> > vectors? In Spark 1.6 the mllib packages had everything as per vectors
> and
> > that should be faster as per my understanding. But in many spark blogs we
> > saw that spark is moving towards the ml package and mllib package will be
> > phased out. So how can someone train using huge data and then use it on a
> > row by row basis?
> >
> > Thanks for your inputs.
> >
> > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <sowen@cloudera.com> wrote:
> >>
> >> If you're trying to score a single example by way of an RDD or
> >> Dataset, then no it will never be that fast. It's a whole distributed
> >> operation, and while you might manage low latency for one job at a
> >> time, consider what will happen when hundreds of them are running at
> >> once. It's just huge overkill for scoring a single example (but,
> >> pretty fine for high-er latency, high throughput batch operations)
> >>
> >> However if you're scoring a Vector locally I can't imagine it's that
> >> slow. It does some linear algebra but it's not that complicated. Even
> >> something unoptimized should be fast.
> >>
> >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbansal2@gmail.com>
> wrote:
> >> > Hi
> >> >
> >> > Currently trying to use NaiveBayes to make predictions. But facing
> >> > issues
> >> > that doing the predictions takes order of few seconds. I tried with
> >> > other
> >> > model examples shipped with Spark but they also ran in minimum of 500
> ms
> >> > when I used Scala API. With
> >> >
> >> > Has anyone used spark ML to do predictions for a single row under 20
> ms?
> >> >
> >> > I am not doing premature optimization. The use case is that we are
> doing
> >> > real time predictions and we need results 20ms. Maximum 30ms. This is
> a
> >> > hard
> >> > limit for our use case.
> >
> >
>

Mime
View raw message