metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Miklavcic <michael.miklav...@gmail.com>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Mon, 23 Jan 2017 20:42:27 GMT
Casey,

I think this is a move in the right direction. I am partial to the DSL.
While it is another DSL to learn, I believe it is far easier to understand
and write lookback rules for than it would be using the map of values.

I'd like to suggest that the concept of holiday should be pluggable by the
user. I would also like to make a case for the includes and excludes
prioritization being based on order in the DSL. So in the holiday example
you might be able to say something like PROFILE_LOOKBACK( '1 hour bins from
1 hour to 1 month including tuesdays excluding holidays including newyears')

Thanks,
Mike


On Mon, Jan 23, 2017 at 1:01 PM, Casey Stella <cestella@gmail.com> wrote:

> Hi All,
>
> I'm planning to expand the capabilities of PROFILE_GET and wanted to pass
> an idea past the community.
>
> *Current State*
>
> Currently, the functionality of PROFILE_GET is fairly straightforward:
>
>    - profile - The name of the profile.
>    - entity - The name of the entity.
>    - durationAgo - How long ago should values be retrieved from?
>    - units - The units of 'durationAgo'.
>    - groups_list - Optional, must correspond to the 'groupBy' list used in
>    profile creation - List (in square brackets) of groupBy values used to
>    filter the profile. Default is the empty list, meaning groupBy was not
> used
>    when creating the profile.
>    - config_overrides - Optional - Map (in curly braces) of name:value
>    pairs, each overriding the global config parameter of the same name.
>    Default is the empty Map, meaning no overrides.
>
> This has the advantage of providing a relatively simple mechanism to
> support the dominant use-case, gathering the profiles for a trailing
> window.  The issues, however, are a couple:
>
>    - We may need more complex semantics for specifying the window
>    (motivated below)
>    - As such, this couples the gathering of the profiles with the
>    specification of the window.
>
> I propose to decouple these two concepts. I propose that we extract the
> notion of the lookback into a separate, more featureful function called
> PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET,
> whose arguments look like:
>
>
>    - profile - The name of the profile.
>    - entity - The name of the entity.
>    - timestamps - The list of timestamps to retrieve
>    - groups_list - Optional, must correspond to the 'groupBy' list used in
>    profile creation - List (in square brackets) of groupBy values used to
>    filter the profile. Default is the empty list, meaning groupBy was not
> used
>    when creating the profile.
>    - config_overrides - Optional - Map (in curly braces) of name:value
>    pairs, each overriding the global config parameter of the same name.
>    Default is the empty Map, meaning no overrides.
>
> So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as
> its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
> PROFILE_LOOKBACK(...)) ).
>
> *Motivation for Change*
>
> The justification for this is that sometimes you want to compare time bins
> for a long duration back, but you don't want to skew the data by including
> periods that aren't distributionally similar (due to seasonal data, for
> instance).  You might want to compare a value to statistically baseline of
> the median of the values for the same time window on the same day for the
> last month (e.g. every tuesday at this time).
>
> Also, we might want a trailing window that does not start at the current
> time (in wall-clock), but rather starts an hour back or from the time that
> the data was originally ingested.
>
>
> *PROFILE_LOOKBACK*
>
> I propose that we support the following features:
>
>    - A starting point that is not current time
>    - Sparse bins (i.e. the last hour for every tuesday for the last month)
>    - The ability to skip events (e.g. weekends, holidays)
>
>
> This would result in a new function with the following arguments:
>
>    -
>
>    from - The lookback starting point (default to now)
>    -
>
>    fromUnits - The units for the lookback starting point
>    -
>
>    to - The ending point for the lookback window (default to from +
> binSize)
>    -
>
>    toUnits - The units for the lookback ending point
>    -
>
>    including - A list of conditions which we would skip.
>    - weekend
>       - holiday
>       - sunday through saturday
>    -
>
>    excluding - A list of conditions which we would skip.
>    - weekend
>       - holiday
>       - sunday through saturday
>    -
>
>    binSize - The size of the lookback bin
>    -
>
>    binUnits - The units of the lookback bin
>
> Given the number of arguments and their complexity and the fact that many,
> many are optional, I propose that either
>
>    - PROFILE_LOOKBACK take a Map so that we can get essentially named
>    params in stellar.
>    - PROFILE_LOOKBACK accept a string backed by a DSL to express these
>    criteria
>
>
> Ok, so that's a lot to take in.  How about we look at some motivating
> use-cases.
>
> *Base Case: A lookback of 1 hour ago*
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from now')
>
>
> *The same time window every tuesday for the last month starting one hour
> ago*
>
> Just to make this as clear as possible, if this is run at 3PM on Monday
> January 23rd, 2017, it would include the following bins:
>
>    - January 17th, 2PM - 3PM
>    - January 10th, 2PM - 3PM
>    - January 3rd, 2PM - 3PM
>    - December 27th, 2PM - 3PM
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
> : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' : 'HOURS'
> } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays')
>
> *The same time window every sunday for the last month starting one hour ago
> skipping holidays*
>
> Just to make this as clear as possible, if this is run at 3PM on Monday
> January 22rd, 2017, it would include the following bins:
>
>    - January 16th, 2PM - 3PM
>    - January 9th, 2PM - 3PM
>    - January 2rd, 2PM - 3PM
>    - NOT December 25th
>
> As a map, this would look like:
>
> PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
> : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
> 'binSize' : 1, 'binUnits' : 'HOURS' } )
>
> As a DSL this would look like:
> PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays
> excluding holidays')
>
> *DSL vs API*
>
> So, here's my personal rundown of the two approaches:
>
> DSL:
>
>    - PRO
>    - Clear.  As you can see, it reads like a sentence
>       - Concise
>    - CON:
>       - More complex to implement
>       - Another DSL to learn
>
> API:
>
>    - PRO
>       - Simpler to implement (though marginally so, IMO)
>    - CON
>       - A bit more complex to understand (also, IMO)
>
> I'd like to solicit feedback from the community at this point:
>
>    - What do you think of this change?
>    - Would you prefer the DSL, API or other approach?
>
> Thanks,
>
> Casey
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message