metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Mon, 23 Jan 2017 20:53:30 GMT
Yeah, that 'holidays' bit is harder than I initially thought.

On Mon, Jan 23, 2017 at 3:42 PM, Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Casey,
>
> I think this is a move in the right direction. I am partial to the DSL.
> While it is another DSL to learn, I believe it is far easier to understand
> and write lookback rules for than it would be using the map of values.
>
> I'd like to suggest that the concept of holiday should be pluggable by the
> user. I would also like to make a case for the includes and excludes
> prioritization being based on order in the DSL. So in the holiday example
> you might be able to say something like PROFILE_LOOKBACK( '1 hour bins from
> 1 hour to 1 month including tuesdays excluding holidays including
> newyears')
>
> Thanks,
> Mike
>
>
> On Mon, Jan 23, 2017 at 1:01 PM, Casey Stella <cestella@gmail.com> wrote:
>
> > Hi All,
> >
> > I'm planning to expand the capabilities of PROFILE_GET and wanted to pass
> > an idea past the community.
> >
> > *Current State*
> >
> > Currently, the functionality of PROFILE_GET is fairly straightforward:
> >
> >    - profile - The name of the profile.
> >    - entity - The name of the entity.
> >    - durationAgo - How long ago should values be retrieved from?
> >    - units - The units of 'durationAgo'.
> >    - groups_list - Optional, must correspond to the 'groupBy' list used
> in
> >    profile creation - List (in square brackets) of groupBy values used to
> >    filter the profile. Default is the empty list, meaning groupBy was not
> > used
> >    when creating the profile.
> >    - config_overrides - Optional - Map (in curly braces) of name:value
> >    pairs, each overriding the global config parameter of the same name.
> >    Default is the empty Map, meaning no overrides.
> >
> > This has the advantage of providing a relatively simple mechanism to
> > support the dominant use-case, gathering the profiles for a trailing
> > window.  The issues, however, are a couple:
> >
> >    - We may need more complex semantics for specifying the window
> >    (motivated below)
> >    - As such, this couples the gathering of the profiles with the
> >    specification of the window.
> >
> > I propose to decouple these two concepts. I propose that we extract the
> > notion of the lookback into a separate, more featureful function called
> > PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET,
> > whose arguments look like:
> >
> >
> >    - profile - The name of the profile.
> >    - entity - The name of the entity.
> >    - timestamps - The list of timestamps to retrieve
> >    - groups_list - Optional, must correspond to the 'groupBy' list used
> in
> >    profile creation - List (in square brackets) of groupBy values used to
> >    filter the profile. Default is the empty list, meaning groupBy was not
> > used
> >    when creating the profile.
> >    - config_overrides - Optional - Map (in curly braces) of name:value
> >    pairs, each overriding the global config parameter of the same name.
> >    Default is the empty Map, meaning no overrides.
> >
> > So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as
> > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
> > PROFILE_LOOKBACK(...)) ).
> >
> > *Motivation for Change*
> >
> > The justification for this is that sometimes you want to compare time
> bins
> > for a long duration back, but you don't want to skew the data by
> including
> > periods that aren't distributionally similar (due to seasonal data, for
> > instance).  You might want to compare a value to statistically baseline
> of
> > the median of the values for the same time window on the same day for the
> > last month (e.g. every tuesday at this time).
> >
> > Also, we might want a trailing window that does not start at the current
> > time (in wall-clock), but rather starts an hour back or from the time
> that
> > the data was originally ingested.
> >
> >
> > *PROFILE_LOOKBACK*
> >
> > I propose that we support the following features:
> >
> >    - A starting point that is not current time
> >    - Sparse bins (i.e. the last hour for every tuesday for the last
> month)
> >    - The ability to skip events (e.g. weekends, holidays)
> >
> >
> > This would result in a new function with the following arguments:
> >
> >    -
> >
> >    from - The lookback starting point (default to now)
> >    -
> >
> >    fromUnits - The units for the lookback starting point
> >    -
> >
> >    to - The ending point for the lookback window (default to from +
> > binSize)
> >    -
> >
> >    toUnits - The units for the lookback ending point
> >    -
> >
> >    including - A list of conditions which we would skip.
> >    - weekend
> >       - holiday
> >       - sunday through saturday
> >    -
> >
> >    excluding - A list of conditions which we would skip.
> >    - weekend
> >       - holiday
> >       - sunday through saturday
> >    -
> >
> >    binSize - The size of the lookback bin
> >    -
> >
> >    binUnits - The units of the lookback bin
> >
> > Given the number of arguments and their complexity and the fact that
> many,
> > many are optional, I propose that either
> >
> >    - PROFILE_LOOKBACK take a Map so that we can get essentially named
> >    params in stellar.
> >    - PROFILE_LOOKBACK accept a string backed by a DSL to express these
> >    criteria
> >
> >
> > Ok, so that's a lot to take in.  How about we look at some motivating
> > use-cases.
> >
> > *Base Case: A lookback of 1 hour ago*
> >
> > As a map, this would look like:
> >
> > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
> >
> > As a DSL this would look like:
> > PROFILE_LOOKBACK( '1 hour bins from now')
> >
> >
> > *The same time window every tuesday for the last month starting one hour
> > ago*
> >
> > Just to make this as clear as possible, if this is run at 3PM on Monday
> > January 23rd, 2017, it would include the following bins:
> >
> >    - January 17th, 2PM - 3PM
> >    - January 10th, 2PM - 3PM
> >    - January 3rd, 2PM - 3PM
> >    - December 27th, 2PM - 3PM
> >
> > As a map, this would look like:
> >
> > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
> 'toUnits'
> > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' :
> 'HOURS'
> > } )
> >
> > As a DSL this would look like:
> > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
> tuesdays')
> >
> > *The same time window every sunday for the last month starting one hour
> ago
> > skipping holidays*
> >
> > Just to make this as clear as possible, if this is run at 3PM on Monday
> > January 22rd, 2017, it would include the following bins:
> >
> >    - January 16th, 2PM - 3PM
> >    - January 9th, 2PM - 3PM
> >    - January 2rd, 2PM - 3PM
> >    - NOT December 25th
> >
> > As a map, this would look like:
> >
> > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
> 'toUnits'
> > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
> > 'binSize' : 1, 'binUnits' : 'HOURS' } )
> >
> > As a DSL this would look like:
> > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays
> > excluding holidays')
> >
> > *DSL vs API*
> >
> > So, here's my personal rundown of the two approaches:
> >
> > DSL:
> >
> >    - PRO
> >    - Clear.  As you can see, it reads like a sentence
> >       - Concise
> >    - CON:
> >       - More complex to implement
> >       - Another DSL to learn
> >
> > API:
> >
> >    - PRO
> >       - Simpler to implement (though marginally so, IMO)
> >    - CON
> >       - A bit more complex to understand (also, IMO)
> >
> > I'd like to solicit feedback from the community at this point:
> >
> >    - What do you think of this change?
> >    - Would you prefer the DSL, API or other approach?
> >
> > Thanks,
> >
> > Casey
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message