metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Mon, 23 Jan 2017 20:01:40 GMT
Hi All,

I'm planning to expand the capabilities of PROFILE_GET and wanted to pass
an idea past the community.

*Current State*

Currently, the functionality of PROFILE_GET is fairly straightforward:

   - profile - The name of the profile.
   - entity - The name of the entity.
   - durationAgo - How long ago should values be retrieved from?
   - units - The units of 'durationAgo'.
   - groups_list - Optional, must correspond to the 'groupBy' list used in
   profile creation - List (in square brackets) of groupBy values used to
   filter the profile. Default is the empty list, meaning groupBy was not used
   when creating the profile.
   - config_overrides - Optional - Map (in curly braces) of name:value
   pairs, each overriding the global config parameter of the same name.
   Default is the empty Map, meaning no overrides.

This has the advantage of providing a relatively simple mechanism to
support the dominant use-case, gathering the profiles for a trailing
window.  The issues, however, are a couple:

   - We may need more complex semantics for specifying the window
   (motivated below)
   - As such, this couples the gathering of the profiles with the
   specification of the window.

I propose to decouple these two concepts. I propose that we extract the
notion of the lookback into a separate, more featureful function called
PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET,
whose arguments look like:


   - profile - The name of the profile.
   - entity - The name of the entity.
   - timestamps - The list of timestamps to retrieve
   - groups_list - Optional, must correspond to the 'groupBy' list used in
   profile creation - List (in square brackets) of groupBy values used to
   filter the profile. Default is the empty list, meaning groupBy was not used
   when creating the profile.
   - config_overrides - Optional - Map (in curly braces) of name:value
   pairs, each overriding the global config parameter of the same name.
   Default is the empty Map, meaning no overrides.

So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as
its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
PROFILE_LOOKBACK(...)) ).

*Motivation for Change*

The justification for this is that sometimes you want to compare time bins
for a long duration back, but you don't want to skew the data by including
periods that aren't distributionally similar (due to seasonal data, for
instance).  You might want to compare a value to statistically baseline of
the median of the values for the same time window on the same day for the
last month (e.g. every tuesday at this time).

Also, we might want a trailing window that does not start at the current
time (in wall-clock), but rather starts an hour back or from the time that
the data was originally ingested.


*PROFILE_LOOKBACK*

I propose that we support the following features:

   - A starting point that is not current time
   - Sparse bins (i.e. the last hour for every tuesday for the last month)
   - The ability to skip events (e.g. weekends, holidays)


This would result in a new function with the following arguments:

   -

   from - The lookback starting point (default to now)
   -

   fromUnits - The units for the lookback starting point
   -

   to - The ending point for the lookback window (default to from + binSize)
   -

   toUnits - The units for the lookback ending point
   -

   including - A list of conditions which we would skip.
   - weekend
      - holiday
      - sunday through saturday
   -

   excluding - A list of conditions which we would skip.
   - weekend
      - holiday
      - sunday through saturday
   -

   binSize - The size of the lookback bin
   -

   binUnits - The units of the lookback bin

Given the number of arguments and their complexity and the fact that many,
many are optional, I propose that either

   - PROFILE_LOOKBACK take a Map so that we can get essentially named
   params in stellar.
   - PROFILE_LOOKBACK accept a string backed by a DSL to express these
   criteria


Ok, so that's a lot to take in.  How about we look at some motivating
use-cases.

*Base Case: A lookback of 1 hour ago*

As a map, this would look like:

PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )

As a DSL this would look like:
PROFILE_LOOKBACK( '1 hour bins from now')


*The same time window every tuesday for the last month starting one hour
ago*

Just to make this as clear as possible, if this is run at 3PM on Monday
January 23rd, 2017, it would include the following bins:

   - January 17th, 2PM - 3PM
   - January 10th, 2PM - 3PM
   - January 3rd, 2PM - 3PM
   - December 27th, 2PM - 3PM

As a map, this would look like:

PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
: 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' : 'HOURS'
} )

As a DSL this would look like:
PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays')

*The same time window every sunday for the last month starting one hour ago
skipping holidays*

Just to make this as clear as possible, if this is run at 3PM on Monday
January 22rd, 2017, it would include the following bins:

   - January 16th, 2PM - 3PM
   - January 9th, 2PM - 3PM
   - January 2rd, 2PM - 3PM
   - NOT December 25th

As a map, this would look like:

PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
: 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
'binSize' : 1, 'binUnits' : 'HOURS' } )

As a DSL this would look like:
PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays
excluding holidays')

*DSL vs API*

So, here's my personal rundown of the two approaches:

DSL:

   - PRO
   - Clear.  As you can see, it reads like a sentence
      - Concise
   - CON:
      - More complex to implement
      - Another DSL to learn

API:

   - PRO
      - Simpler to implement (though marginally so, IMO)
   - CON
      - A bit more complex to understand (also, IMO)

I'd like to solicit feedback from the community at this point:

   - What do you think of this change?
   - Would you prefer the DSL, API or other approach?

Thanks,

Casey

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message