spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjin Lee <>
Subject Re: Spark Local Pipelines
Date Mon, 13 Mar 2017 15:08:29 GMT
Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
think it would be much better to live outside of the project.


On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen <> wrote:

> I'm skeptical.  Serving synchronous queries from a model at scale is a
> fundamentally different activity. As you note, it doesn't logically involve
> Spark. If it has to happen in milliseconds it's going to be in-core.
> Scoring even 10qps with a Spark job per request is probably a non-starter;
> think of the thousands of tasks per second and the overhead of just
> tracking them.
> When you say the RDDs support point prediction, I think you mean that
> those older models expose a method to score a Vector. They are not somehow
> exposing distributed point prediction. You could add this to the newer
> models, but it raises the question of how to make the Row to feed it? the
> .mllib punts on this and assumes you can construct the Vector.
> I think this sweeps a lot under the rug in assuming that there can just be
> a "local" version of every Transformer -- but, even if there could be,
> consider how much extra implementation that is. Lots of them probably could
> be but I'm not sure that all can.
> The bigger problem in my experience is the Pipelines don't generally
> encapsulate the entire pipeline from source data to score. They encapsulate
> the part after computing underlying features. That is, if one of your
> features is "total clicks from this user", that's the product of a
> DataFrame operation that precedes a Pipeline. This can't be turned into a
> non-distributed, non-Spark local version.
> Solving subsets of this problem could still be useful, and you've
> highlighted some external projects that try. I'd also highlight PMML as an
> established interchange format for just the model part, and for cases that
> don't involve much or any pipeline, it's a better fit paired with a library
> that can score from PMML.
> I think this is one of those things that could live outside the project,
> because it's more not-Spark than Spark. Remember too that building a
> solution into the project blesses one at the expense of others.
> On Sun, Mar 12, 2017 at 10:15 PM Asher Krim <> wrote:
>> Hi All,
>> I spent a lot of time at Spark Summit East this year talking with Spark
>> developers and committers about challenges with productizing Spark. One of
>> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
>> of a way to serve single requests with any reasonable performance.
>> SPARK-10413 explores adding methods for single item prediction, but I'd
>> like to explore a more holistic approach - a separate local api, with
>> models that support transformations without depending on Spark at all.
>> I've written up a doc
>> <>
>> detailing the approach, and I'm happy to discuss alternatives. If this
>> gains traction, I can create a branch with a minimal example on a simple
>> transformer (probably something like CountVectorizerModel) so we have
>> something concrete to continue the discussion on.
>> Thanks,
>> Asher Krim
>> Senior Software Engineer

*Dongjin Lee*

*Software developer in Line+.So interested in massive-scale machine

View raw message