metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Psaltis <psaltis.and...@gmail.com>
Subject Re: Metron-265 Model as a Service
Date Fri, 08 Jul 2016 16:48:41 GMT
Totally agree and I surely don't want to get in the way either. Just trying
to think of how to keep everything having to do with model execution inside
of the model service and not end up with an architecture in which the model
service has API leakage across layers.

On Fri, Jul 8, 2016 at 11:01 AM, Casey Stella <cestella@gmail.com> wrote:

> Feature selection is so very model specific, that I don't know of a good
> way to generalize that.  Really, I try very hard not to get in the way of
> data scientists doing data science and feature selection sits squarely in
> that domain.  For the DGA model, it's fairly simple to make the cache
> key..the domain.  There should be a strong working set for that model, so
> it should be fairly effective.
>
> On Fri, Jul 8, 2016 at 10:05 AM, Andrew Psaltis <psaltis.andrew@gmail.com>
> wrote:
>
> > Very interesting. Stellar does make life easier/more transparent for JVM
> > developers, but it would seem that there will end-up being a client that
> > has to be developed for non-jvm languages otherwise folks will be left to
> > do all of the work themselves.
> >
> > Based on some of the reasons cited above (GPU, etc.) it seems like a
> > requirement to have a separate model service. It would be really nice to
> > find a way to not let the innards of that API leak into everywhere else.
> If
> > all feature selection happens in the model service, that then raises some
> > of the same questions SImon did as far as how can the Storm bolts
> generate
> > a good cache key so that the model service does not need to be called.
> > Perhaps, this could be solved with a model service library and not just
> > exposing a REST (or other protocol) endpoint, the feature selection and
> > cache key generation can be hidden from clients. This still poses the
> issue
> > of having to carry this abstraction across languages.
> >
> > On Fri, Jul 8, 2016 at 9:33 AM, Casey Stella <cestella@gmail.com> wrote:
> >
> > > So, it's an interesting discussion about where exactly feature
> selection
> > > happens.  I suspect it will happen in multiple places.  Let's take the
> > DGA
> > > model as our motivating example.  This guy is likely to not require too
> > > much beyond the domain name.  The model code should pull apart the
> > features
> > > from that domain (entropy, the tld, stripping subdomains, etc.) and
> > > probably has some reference data resident within the model (probability
> > of
> > > bigrams in various languages, for instance) that it will use to build
> the
> > > real input.  As it stands, a lot of the feature selection is likely to
> be
> > > done in the model, but the model should be explicit about what it wants
> > as
> > > input.  For instance, it could demand only the raw domain or it could
> > > demand the subdomains, domain and tld to be separated out.  In either
> > case,
> > > the caching should work as long as the model is deterministic for a
> given
> > > input.
> > >
> > > I think it will be interesting to see, however, where the feature
> > selection
> > > will happen for most models.  This is essentially why I pushed for this
> > to
> > > be part of stellar, so that some transformation can be done prior to
> > model
> > > execution.  For instance,  for our DGA model, you could call
> > > MODEL_APPLY('dga',  DOMAIN_REMOVE_TLD(domain), DOMAIN_TO_TLD(domain) )
> > > which would do the (admittedly not-so) heavy work of separating tlds
> from
> > > the domain inside of stellar as opposed to having to do it in the
> > language
> > > that is being used to implement your model.
> > >
> > >
> > > On Thu, Jul 7, 2016 at 6:33 PM, Simon Ball <sball@hortonworks.com>
> > wrote:
> > >
> > > > There is an interesting division of concerns here that might impact
> the
> > > > design.
> > > >
> > > > If we're looking to cache things like dga which operate on a subset
> of
> > > the
> > > > enriched Metron data model, then we essentially need to push the
> > feature
> > > > selection, or at least feature projection elements of the model to
> the
> > > edge
> > > > (the bolt) to produce a cache key. This seems to make sense in the
> > > context
> > > > of the function call to the model proposed, but means that the model
> > call
> > > > does not apply to a whole Metron data record, but a subset determined
> > by
> > > > that call on the dsl. This implicitly pushes model related concerns
> > > > (feature selection) outside of the canonical scope for defining the
> > > models
> > > > themselves (the model service), which loses model encapsulation.
> > > >
> > > > In essence you would be embedding the feature selection (projection)
> of
> > > > the model engine in the storm bolts in order to make caching
> possible,
> > > > which would need some sort of central control, and rationalisation to
> > > avoid
> > > > cache misses between multiple models with slightly different feature
> > > sets.
> > > > This could add complexity, or reduce cache utilisation really quickly
> > > with
> > > > model scale.
> > > >
> > > > Simon
> > > >
> > > >
> > > > > On 7 Jul 2016, at 18:51, Casey Stella <cestella@gmail.com> wrote:
> > > > >
> > > > > Great questions Andrew.  Thanks for the interest. :)
> > > > >
> > > > > RE:: "which is why there would be a caching layer set in front of
> it
> > at
> > > > the
> > > > > Storm bolt level"
> > > > >
> > > > > Right now we have a LRU caching layer in front of the HBase
> > enrichment
> > > > > adapters, so it would work similarly.  You can imagine, the range
> of
> > > > inputs
> > > > > is likely not perfectly random, so it's reasonable for the cache to
> > > have
> > > > a
> > > > > non-empty working set.  Take for instance a DGA model; the input
> > would
> > > > be a
> > > > > domain and most organizations will have an uneven distribution of
> > > domains
> > > > > they access with a heavy skew toward a small number.
> > > > >
> > > > > RE: In this scenario, you can at least scale out via load balancing
> > > (i.e.
> > > > > multiple model services serving the same model) since the models
> are
> > > > > immutable.
> > > > >
> > > > > I am talking about model execution here.  The endpoints are
> > distributed
> > > > > across the cluster and the storm bolt chooses a service to use
> (with
> > a
> > > > bias
> > > > > toward using one that is local to that bolt) and the request is
> made
> > to
> > > > the
> > > > > endpoint, which scores the input and returns the response.
> > > > >
> > > > > Model service, if that term means what I think it means, is almost
> > > > entirely
> > > > > done inside of zookeeper.  For clarity, I'm talking about service
> > > > discovery
> > > > > (bolt discovers which endpoints serve which models) and model
> > updates.
> > > > We
> > > > > are not sending the model around to any bolts or any such thing,
> just
> > > for
> > > > > clarity sake.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jul 7, 2016 at 9:47 AM, Andrew Psaltis <
> > > psaltis.andrew@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Thanks Casey! Couple of quick questions.
> > > > >>
> > > > >> RE:: "which is why there would be a caching layer set in front of
> it
> > > at
> > > > the
> > > > >> Storm bolt level"
> > > > >> Hmm, would this be of the results of model execution? Would this
> > > really
> > > > >> work when each tuple may contain totally different data? Or is the
> > > > caching
> > > > >> going to be smart enough that it will look at all the data passed
> in
> > > and
> > > > >> determine that an identical tuple has already been evaluated so
> > serve
> > > > the
> > > > >> result out of cache?
> > > > >>
> > > > >> RE: "Also, we would prefer local instances of the service when and
> > > where
> > > > >> possible"
> > > > >> Perfect makes sense.
> > > > >>
> > > > >> RE: Serving many models from every storm bolt is also fairly
> > > expensive.
> > > > >> I can see how it could be, but couldn't  we can make sure that not
> > all
> > > > >> models live in every bolt?
> > > > >>
> > > > >> RE: In this scenario, you can at least scale out via load
> balancing
> > > > (i.e.
> > > > >> multiple model services serving the same model) since the models
> are
> > > > >> immutable.
> > > > >> This seems to address the model serving, not model execution
> > service.
> > > > >> Having yet one more layer to scale and mange also seems like it
> > > > >> would further complicate things. Could we not just also scale the
> > > bolts?
> > > > >>
> > > > >> Thanks,
> > > > >> Andrew
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Thu, Jul 7, 2016 at 12:37 PM, Casey Stella <
> cestella@gmail.com>
> > > > wrote:
> > > > >>>
> > > > >>> So, regarding the expense of communication; I tend to agree that
> it
> > > is
> > > > >>> expensive, which is why there would be a caching layer set in
> front
> > > of
> > > > it
> > > > >>> at the Storm bolt level.  Also, we would prefer local instances
> of
> > > the
> > > > >>> service when and where possible.  Serving many models from every
> > > storm
> > > > >> bolt
> > > > >>> is also fairly expensive.  In this scenario, you can at least
> scale
> > > out
> > > > >> via
> > > > >>> load balancing (i.e. multiple model services serving the same
> > model)
> > > > >> since
> > > > >>> the models are immutable.
> > > > >>>
> > > > >>> On Thu, Jul 7, 2016 at 9:24 AM, Andrew Psaltis <
> > > > psaltis.andrew@gmail.com
> > > > >>>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> OK that makes sense. So the doc attached to this JIRA[1] just
> > speaks
> > > > to
> > > > >>> the
> > > > >>>> Model serving. Is there a doc for the model service? And by
> making
> > > > >> this a
> > > > >>>> separate service we are saying that for every
> > > > “MODEL_APPLY(model_name,
> > > > >>>> param_1, param_2, …, param_n)” we are potentially going to go
> > across
> > > > >> the
> > > > >>>> wire and have a model executed? That seems pretty expensive, no?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Andrew
> > > > >>>>
> > > > >>>> [1] https://issues.apache.org/jira/browse/METRON-265
> > > > >>>>
> > > > >>>>> On Thu, Jul 7, 2016 at 12:20 PM, Casey Stella <
> > cestella@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> The "REST" model service, which I place in quotes because there
> > is
> > > > >> some
> > > > >>>>> strong discussion about whether REST is a reasonable transport
> > for
> > > > >>> this,
> > > > >>>> is
> > > > >>>>> responsible for providing the model.  The scoring/model
> > application
> > > > >>>> happens
> > > > >>>>> in the model service and the results get transferred back to
> the
> > > > >> storm
> > > > >>>> bolt
> > > > >>>>> that calls it.
> > > > >>>>>
> > > > >>>>> Casey
> > > > >>>>>
> > > > >>>>> On Thu, Jul 7, 2016 at 9:17 AM, Andrew Psaltis <
> > > > >>> psaltis.andrew@gmail.com
> > > > >>>>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Trying to make sure I grok this thread and the word doc
> attached
> > > to
> > > > >>> the
> > > > >>>>>> JIRA. The word doc and JIRA speak to a Model Service Service
> and
> > > > >> that
> > > > >>>> the
> > > > >>>>>> REST service will be responsible for serving up models.
> However,
> > > > >> part
> > > > >>>> of
> > > > >>>>>> this conversation seems to suggest that the model execution
> will
> > > > >>>> actually
> > > > >>>>>> occur at the REST service .. in particular this comment from
> > > James:
> > > > >>>>>>
> > > > >>>>>> "There are several reasons to decouple model execution from
> > > Storm:"
> > > > >>>>>>
> > > > >>>>>> If the model execution is decoupled from Storm then it appears
> > > that
> > > > >>> the
> > > > >>>>>> REST service will be executing the model, not just serving it
> > up,
> > > > >> is
> > > > >>>> that
> > > > >>>>>> correct?
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Andrew
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> On Thu, Jul 7, 2016 at 11:51 AM, Casey Stella <
> > > cestella@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Regarding the performance of REST:
> > > > >>>>>>>
> > > > >>>>>>> Yep, so everyone seems to be worried about the performance
> > > > >>>> implications
> > > > >>>>>> for
> > > > >>>>>>> REST.  I made this comment on the JIRA, but I'll repeat it
> here
> > > > >> for
> > > > >>>>>> broader
> > > > >>>>>>> discussion:
> > > > >>>>>>>
> > > > >>>>>>> My choice of REST was mostly due to the fact that I want to
> > > > >> support
> > > > >>>>>>>> multi-language (I think that's a very important requirement)
> > > > >> and
> > > > >>>>> there
> > > > >>>>>>> are
> > > > >>>>>>>> REST libraries for pretty much everything. I do agree,
> > however,
> > > > >>>> that
> > > > >>>>>> JSON
> > > > >>>>>>>> transport can get chunky. How about a compromise and use
> REST,
> > > > >>> but
> > > > >>>>> the
> > > > >>>>>>>> input and output payloads for scoring are Maps encoded in
> > > > >> msgpack
> > > > >>>>>> rather
> > > > >>>>>>>> than JSON. There is a msgpack library for pretty much every
> > > > >>>> language
> > > > >>>>>> out
> > > > >>>>>>>> there (almost) and certainly all of the ones we'd like to
> > > > >> target.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> The other option is to just create and expose protobuf
> > bindings
> > > > >>>>> (thrift
> > > > >>>>>>>> doesn't have a native client for R) for all of the languages
> > > > >> that
> > > > >>>> we
> > > > >>>>>> want
> > > > >>>>>>>> to support. I'm perfectly fine with that, but I had some
> > > > >> worries
> > > > >>>>> about
> > > > >>>>>>> the
> > > > >>>>>>>> maturity of the bindings.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> The final option, as you suggest, is to just use raw
> sockets.
> > I
> > > > >>>> think
> > > > >>>>>> if
> > > > >>>>>>>> we went that route, we might have to create a layer for each
> > > > >>>> language
> > > > >>>>>>>> rather than relying on model creators to create a TCP
> server.
> > I
> > > > >>>>> thought
> > > > >>>>>>>> that might be a bit onerous for a MVP.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> Given the discussion, though, what it has made me aware of
> is
> > > > >>> that
> > > > >>>> we
> > > > >>>>>>>> might not want to dictate a transport mechanism at all, but
> > > > >>> rather
> > > > >>>>>> allow
> > > > >>>>>>>> that to be pluggable and extensible (so each model would be
> > > > >>>>> associated
> > > > >>>>>>> with
> > > > >>>>>>>> a transport mechanism handler that would know how to
> > > > >> communicate
> > > > >>> to
> > > > >>>>> it.
> > > > >>>>>>> We
> > > > >>>>>>>> would provide default mechanisms for msgpack over REST, JSON
> > > > >> over
> > > > >>>>> REST
> > > > >>>>>>> and
> > > > >>>>>>>> maybe msgpack over raw TCP.) Thoughts?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Regarding PMML:
> > > > >>>>>>>
> > > > >>>>>>> I tend to agree with James that PMML is too restrictive as to
> > > > >>> models
> > > > >>>> it
> > > > >>>>>> can
> > > > >>>>>>> represent and I have not had great experiences with it in
> > > > >>> production.
> > > > >>>>>>> Also, the open source libraries for PMML have licensing
> issues
> > > > >>> (jpmml
> > > > >>>>>>> requires an older version to accommodate our licensing
> > > > >>> requirements).
> > > > >>>>>>>
> > > > >>>>>>> Regarding workflow:
> > > > >>>>>>>
> > > > >>>>>>> At the moment, I'd like to focus on getting a generalized
> > > > >>>>> infrastructure
> > > > >>>>>>> for model scoring and updating put in place.   This means,
> this
> > > > >>>>>>> architecture takes up the baton from the point when a model
> is
> > > > >>>>>>> trained/created.  Also, I have attempted to be generic in
> terms
> > > > >> of
> > > > >>>>> output
> > > > >>>>>>> of the model (a map of results) so it can fit any type of
> model
> > > > >>> that
> > > > >>>> I
> > > > >>>>>> can
> > > > >>>>>>> think of.  If that's not the case, let me know, though.
> > > > >>>>>>>
> > > > >>>>>>> For instance, for clustering, you would probably emit the
> > cluster
> > > > >>> id
> > > > >>>>>>> associated with the input and that would be added to the
> > message
> > > > >> as
> > > > >>>> it
> > > > >>>>>>> passes through the storm topology.  The model is responsible
> > for
> > > > >>>>>> processing
> > > > >>>>>>> the input and constructing properly formed output.
> > > > >>>>>>>
> > > > >>>>>>> Casey
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Tue, Jul 5, 2016 at 3:45 PM, Debo Dutta (dedutta) <
> > > > >>>>> dedutta@cisco.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Following up on the thread a little late …. Awesome start
> > > > >> Casey.
> > > > >>>> Some
> > > > >>>>>>>> comments:
> > > > >>>>>>>> * Model execution
> > > > >>>>>>>> ** I am guessing the model execution will be on YARN only
> for
> > > > >>> now.
> > > > >>>>> This
> > > > >>>>>>> is
> > > > >>>>>>>> fine, but the REST call could have an overhead - depends on
> > the
> > > > >>>>> speed.
> > > > >>>>>>>> * PMML: won’t we have to choose some DSL for describing
> > models?
> > > > >>>>>>>> * Model:
> > > > >>>>>>>> ** workflow vs a model -  do we care about the “workflow"
> that
> > > > >>>> leads
> > > > >>>>> to
> > > > >>>>>>>> the models or just the “model"? For example, we might start
> > > > >> with
> > > > >>> n
> > > > >>>>>>> features
> > > > >>>>>>>> —> do feature selection to choose k (or apply a transform
> > > > >>> function)
> > > > >>>>> —>
> > > > >>>>>>>> apply a model etc
> > > > >>>>>>>> * Use cases - I can see this working for n-ary
> classification
> > > > >>> style
> > > > >>>>>>> models
> > > > >>>>>>>> easily. Will the same mechanism be used for stuff like
> > > > >> clustering
> > > > >>>> (or
> > > > >>>>>>>> intermediate steps like feature selection alone).
> > > > >>>>>>>>
> > > > >>>>>>>> Thx
> > > > >>>>>>>> debo
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> On 7/5/16, 3:24 PM, "James Sirota" <jsirota@apache.org>
> > wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> Simon,
> > > > >>>>>>>>>
> > > > >>>>>>>>> There are several reasons to decouple model execution from
> > > > >>> Storm:
> > > > >>>>>>>>>
> > > > >>>>>>>>> - Reliability: It's much easier to handle a failed service
> > > > >> than
> > > > >>> a
> > > > >>>>>> failed
> > > > >>>>>>>> bolt.  You can also troubleshoot without having to bring
> down
> > > > >> the
> > > > >>>>>>> topology
> > > > >>>>>>>>> - Complexity: you de-couple the model logic from Storm
> logic
> > > > >> and
> > > > >>>> can
> > > > >>>>>>>> manage it independently of Storm
> > > > >>>>>>>>> - Portability: you can swap the model guts (switch from
> Spark
> > > > >> to
> > > > >>>>>> Flink,
> > > > >>>>>>>> etc) and as long as you maintain the interface you are good
> to
> > > > >> go
> > > > >>>>>>>>> - Consistency: since we want to expose our models the same
> > way
> > > > >>> we
> > > > >>>>>> expose
> > > > >>>>>>>> threat intel then it makes sense to expose them as a service
> > > > >>>>>>>>>
> > > > >>>>>>>>> In our vision for Metron we want to make it easy to uptake
> > and
> > > > >>>> share
> > > > >>>>>>>> models.  I think well-defined interfaces and programmatic
> ways
> > > > >> of
> > > > >>>>>>>> deployment, lifecycle management, and scoring via
> well-defined
> > > > >>> REST
> > > > >>>>>>>> interfaces will make this task easier.  We can do a few
> things
> > > > >> to
> > > > >>>>>>>>>
> > > > >>>>>>>>> With respect to PMML I personally had not had much luck
> with
> > > > >> it
> > > > >>> in
> > > > >>>>>>>> production.  I would prefer models as POJOs.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks,
> > > > >>>>>>>>> James
> > > > >>>>>>>>>
> > > > >>>>>>>>> 04.07.2016, 16:07, "Simon Ball" <sball@hortonworks.com>:
> > > > >>>>>>>>>> Since the models' parameters and execution algorithm are
> > > > >>> likely
> > > > >>>> to
> > > > >>>>>> be
> > > > >>>>>>>> small, why not have the model store push the model changes
> and
> > > > >>>>> scoring
> > > > >>>>>>>> direct to the bolts and execute within storm. This negates
> the
> > > > >>>>> overhead
> > > > >>>>>>> of
> > > > >>>>>>>> a rest call to the model server, and the need for discovery
> of
> > > > >>> the
> > > > >>>>>> model
> > > > >>>>>>>> server in zookeeper.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Something like the way ranger policies are updated /
> cached
> > > > >> in
> > > > >>>>>> plugins
> > > > >>>>>>>> would seem to make sense, so that we're distributing the
> model
> > > > >>>>>> execution
> > > > >>>>>>>> directly into the enrichment pipeline rather than collecting
> > > > >> in a
> > > > >>>>>> central
> > > > >>>>>>>> service.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> This would work with simple models on single events, but
> may
> > > > >>>>>> struggle
> > > > >>>>>>>> with correlation based models. However, those could be
> handled
> > > > >> in
> > > > >>>>> storm
> > > > >>>>>>> by
> > > > >>>>>>>> pushing into a windowing trident topology or something of
> the
> > > > >>> sort,
> > > > >>>>> or
> > > > >>>>>>> even
> > > > >>>>>>>> with a parallel spark streaming job using the same method of
> > > > >>>>>> distributing
> > > > >>>>>>>> models.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> The real challenge here would be stateful online models,
> > > > >> which
> > > > >>>>> seem
> > > > >>>>>>>> like a minority case which could be handled by a shared
> state
> > > > >>> store
> > > > >>>>>> such
> > > > >>>>>>> as
> > > > >>>>>>>> HBase.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> You still keep the ability to run different languages, and
> > > > >>>>>> platforms,
> > > > >>>>>>>> but wrap managing the parallelism in storm bolts rather than
> > > > >> yarn
> > > > >>>>>>>> containers.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> We could also consider basing the model protocol on a a
> > > > >> common
> > > > >>>>> model
> > > > >>>>>>>> language like pmml, thong that is likely to be highly
> > limiting.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Simon
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> On 4 Jul 2016, at 22:35, Casey Stella <
> cestella@gmail.com
> > > > >>>
> > > > >>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> This is great! I'll capture any requirements that anyone
> > > > >>> wants
> > > > >>>>> to
> > > > >>>>>>>>>>> contribute and ensure that the proposed architecture
> > > > >>>>> accommodates
> > > > >>>>>>>> them. I
> > > > >>>>>>>>>>> think we should focus on a minimal set of requirements
> and
> > > > >>> an
> > > > >>>>>>>> architecture
> > > > >>>>>>>>>>> that does not preclude a larger set. I have found that
> the
> > > > >>>> best
> > > > >>>>>>>> driver of
> > > > >>>>>>>>>>> requirements are installed users. :)
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> For instance, I think a lot of questions about how often
> > > > >> to
> > > > >>>>>> update a
> > > > >>>>>>>> model
> > > > >>>>>>>>>>> and such should be represented in the architecture by the
> > > > >>>>> ability
> > > > >>>>>> to
> > > > >>>>>>>>>>> manually update a model, so as long as we have the
> ability
> > > > >>> to
> > > > >>>>>>> update,
> > > > >>>>>>>>>>> people can choose when and where to do it (i.e. time
> based
> > > > >>> or
> > > > >>>>> some
> > > > >>>>>>>> other
> > > > >>>>>>>>>>> trigger). That being said, we don't want to cause too
> much
> > > > >>>>> effort
> > > > >>>>>>> for
> > > > >>>>>>>> the
> > > > >>>>>>>>>>> user if we can avoid it with features.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> In terms of the questions laid out, here are the
> > > > >> constraints
> > > > >>>>> from
> > > > >>>>>>> the
> > > > >>>>>>>>>>> proposed architecture as I see them. It'd be great to get
> > > > >> a
> > > > >>>>> sense
> > > > >>>>>> of
> > > > >>>>>>>>>>> whether these constraints are too onerous or where
> they're
> > > > >>> not
> > > > >>>>>>>> opinionated
> > > > >>>>>>>>>>> enough :
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>  - Model versioning and retention
> > > > >>>>>>>>>>>  - We do have the ability to update models, but the
> > > > >>> training
> > > > >>>>> and
> > > > >>>>>>>> decision
> > > > >>>>>>>>>>>     of when to update the model is left up to the user.
> > > > >> We
> > > > >>>> may
> > > > >>>>>> want
> > > > >>>>>>>> to think
> > > > >>>>>>>>>>>     deeply about when and where automated model updates
> > > > >> can
> > > > >>>> fit
> > > > >>>>>>>>>>>     - Also, retention is currently manual. It might be an
> > > > >>>>> easier
> > > > >>>>>>> win
> > > > >>>>>>>> to
> > > > >>>>>>>>>>>     set up policies around when to sunset models (after
> > > > >>> newer
> > > > >>>>>>>> versions are
> > > > >>>>>>>>>>>     added, for instance).
> > > > >>>>>>>>>>>  - Model access controls management
> > > > >>>>>>>>>>>  - The architecture proposes no constraints around this.
> > > > >> As
> > > > >>>> it
> > > > >>>>>>> stands
> > > > >>>>>>>>>>>     now, models are held in HDFS, so it would inherit the
> > > > >>>> same
> > > > >>>>>>>> security
> > > > >>>>>>>>>>>     capabilities from that (user/group permissions +
> > > > >>> Ranger,
> > > > >>>>> etc)
> > > > >>>>>>>>>>>  - Requirements around concept drift
> > > > >>>>>>>>>>>  - I'd love to hear user requirements around how we could
> > > > >>>>>>>> automatically
> > > > >>>>>>>>>>>     address concept drift. The architecture as it's
> > > > >>> proposed
> > > > >>>>>> let's
> > > > >>>>>>>> the user
> > > > >>>>>>>>>>>     decide when to update models.
> > > > >>>>>>>>>>>  - Requirements around model output
> > > > >>>>>>>>>>>  - The architecture as it stands just mandates a JSON map
> > > > >>>> input
> > > > >>>>>> and
> > > > >>>>>>>> JSON
> > > > >>>>>>>>>>>     map output, so it's up to the model what they want to
> > > > >>>> pass
> > > > >>>>>>> back.
> > > > >>>>>>>>>>>     - It's also up to the model to document its own
> > > > >> output.
> > > > >>>>>>>>>>>  - Any model audit and logging requirements
> > > > >>>>>>>>>>>  - The architecture proposes no constraints around this.
> > > > >>> I'd
> > > > >>>>> love
> > > > >>>>>>> to
> > > > >>>>>>>> see
> > > > >>>>>>>>>>>     community guidance around this. As it stands, we just
> > > > >>> log
> > > > >>>>>> using
> > > > >>>>>>>> the same
> > > > >>>>>>>>>>>     mechanism as any YARN application.
> > > > >>>>>>>>>>>  - What model metrics need to be exposed
> > > > >>>>>>>>>>>  - The architecture proposes no constraints around this.
> > > > >>> I'd
> > > > >>>>> love
> > > > >>>>>>> to
> > > > >>>>>>>> see
> > > > >>>>>>>>>>>     community guidance around this.
> > > > >>>>>>>>>>>     - Requirements around failure modes
> > > > >>>>>>>>>>>  - We briefly touch on this in the document, but it is
> > > > >>>> probably
> > > > >>>>>> not
> > > > >>>>>>>>>>>     complete. Service endpoint failure will result in
> > > > >>>>>> blacklisting
> > > > >>>>>>>> from a
> > > > >>>>>>>>>>>     storm bolt perspective and node failure should result
> > > > >>> in
> > > > >>>> a
> > > > >>>>>> new
> > > > >>>>>>>> container
> > > > >>>>>>>>>>>     being started by the Yarn application master. Beyond
> > > > >>>> that,
> > > > >>>>>> the
> > > > >>>>>>>>>>>     architecture isn't explicit.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> On Mon, Jul 4, 2016 at 1:49 PM, James Sirota <
> > > > >>>>> jsirota@apache.org
> > > > >>>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I left a comment on the JIRA. I think your design is
> > > > >>>> promising.
> > > > >>>>>> One
> > > > >>>>>>>>>>>> other thing I would suggest is for us to crowd source
> > > > >>>>>> requirements
> > > > >>>>>>>> around
> > > > >>>>>>>>>>>> model management. Specifically:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Model versioning and retention
> > > > >>>>>>>>>>>> Model access controls management
> > > > >>>>>>>>>>>> Requirements around concept drift
> > > > >>>>>>>>>>>> Requirements around model output
> > > > >>>>>>>>>>>> Any model audit and logging requirements
> > > > >>>>>>>>>>>> What model metrics need to be exposed
> > > > >>>>>>>>>>>> Requirements around failure modes
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> 03.07.2016, 14:00, "Casey Stella" <cestella@gmail.com>:
> > > > >>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I think we are at the point where we should try to
> > > > >> tackle
> > > > >>>>> Model
> > > > >>>>>>> as a
> > > > >>>>>>>>>>>>> service for Metron. As such, I created a JIRA and
> > > > >> proposed
> > > > >>>> an
> > > > >>>>>>>>>>>> architecture
> > > > >>>>>>>>>>>>> for accomplishing this within Metron.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> My inclination is to be data science language/library
> > > > >>>> agnostic
> > > > >>>>>> and
> > > > >>>>>>>> to
> > > > >>>>>>>>>>>>> provide a general purpose REST infrastructure for
> > > > >> managing
> > > > >>>> and
> > > > >>>>>>>> serving
> > > > >>>>>>>>>>>>> models trained on historical data captured from Metron.
> > > > >>> The
> > > > >>>>>>>> assumption is
> > > > >>>>>>>>>>>>> that we are within the hadoop ecosystem, so:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>  - Models stored on HDFS
> > > > >>>>>>>>>>>>>  - REST Model Services resource-managed via Yarn
> > > > >>>>>>>>>>>>>  - REST Model Services discovered via Zookeeper.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I would really appreciate community comment on the JIRA
> > > > >> (
> > > > >>>>>>>>>>>>> https://issues.apache.org/jira/browse/METRON-265). The
> > > > >>>>> proposed
> > > > >>>>>>>>>>>>> architecture is attached as a document to that JIRA.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I look forward to feedback!
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Casey
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> -------------------
> > > > >>>>>>>>>>>> Thank you,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> James Sirota
> > > > >>>>>>>>>>>> PPMC- Apache Metron (Incubating)
> > > > >>>>>>>>>>>> jsirota AT apache DOT org
> > > > >>>>>>>>>
> > > > >>>>>>>>> -------------------
> > > > >>>>>>>>> Thank you,
> > > > >>>>>>>>>
> > > > >>>>>>>>> James Sirota
> > > > >>>>>>>>> PPMC- Apache Metron (Incubating)
> > > > >>>>>>>>> jsirota AT apache DOT org
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Thanks,
> > > > >>>>>> Andrew
> > > > >>>>>>
> > > > >>>>>> Subscribe to my book: Streaming Data <
> > http://manning.com/psaltis>
> > > > >>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> > > > >>>>>> twiiter: @itmdata <
> > > > >>> http://twitter.com/intent/user?screen_name=itmdata>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> Thanks,
> > > > >>>> Andrew
> > > > >>>>
> > > > >>>> Subscribe to my book: Streaming Data <
> http://manning.com/psaltis>
> > > > >>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> > > > >>>> twiiter: @itmdata <
> > > http://twitter.com/intent/user?screen_name=itmdata
> > > > >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thanks,
> > > > >> Andrew
> > > > >>
> > > > >> Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> > > > >> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> > > > >> twiiter: @itmdata <
> > http://twitter.com/intent/user?screen_name=itmdata
> > > >
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Andrew
> >
> > Subscribe to my book: Streaming Data <http://manning.com/psaltis>
> > <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
> > twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>
> >
>



-- 
Thanks,
Andrew

Subscribe to my book: Streaming Data <http://manning.com/psaltis>
<https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
twiiter: @itmdata <http://twitter.com/intent/user?screen_name=itmdata>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message