metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Sirota <jsir...@apache.org>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Mon, 23 Jan 2017 20:58:40 GMT
We used jollyday for OpenSOC for holiday resolution
http://jollyday.sourceforge.net/

I think its apache licensed. 

23.01.2017, 13:42, "Michael Miklavcic" <michael.miklavcic@gmail.com>:
> Casey,
>
> I think this is a move in the right direction. I am partial to the DSL.
> While it is another DSL to learn, I believe it is far easier to understand
> and write lookback rules for than it would be using the map of values.
>
> I'd like to suggest that the concept of holiday should be pluggable by the
> user. I would also like to make a case for the includes and excludes
> prioritization being based on order in the DSL. So in the holiday example
> you might be able to say something like PROFILE_LOOKBACK( '1 hour bins from
> 1 hour to 1 month including tuesdays excluding holidays including newyears')
>
> Thanks,
> Mike
>
> On Mon, Jan 23, 2017 at 1:01 PM, Casey Stella <cestella@gmail.com> wrote:
>
>>  Hi All,
>>
>>  I'm planning to expand the capabilities of PROFILE_GET and wanted to pass
>>  an idea past the community.
>>
>>  *Current State*
>>
>>  Currently, the functionality of PROFILE_GET is fairly straightforward:
>>
>>     - profile - The name of the profile.
>>     - entity - The name of the entity.
>>     - durationAgo - How long ago should values be retrieved from?
>>     - units - The units of 'durationAgo'.
>>     - groups_list - Optional, must correspond to the 'groupBy' list used in
>>     profile creation - List (in square brackets) of groupBy values used to
>>     filter the profile. Default is the empty list, meaning groupBy was not
>>  used
>>     when creating the profile.
>>     - config_overrides - Optional - Map (in curly braces) of name:value
>>     pairs, each overriding the global config parameter of the same name.
>>     Default is the empty Map, meaning no overrides.
>>
>>  This has the advantage of providing a relatively simple mechanism to
>>  support the dominant use-case, gathering the profiles for a trailing
>>  window. The issues, however, are a couple:
>>
>>     - We may need more complex semantics for specifying the window
>>     (motivated below)
>>     - As such, this couples the gathering of the profiles with the
>>     specification of the window.
>>
>>  I propose to decouple these two concepts. I propose that we extract the
>>  notion of the lookback into a separate, more featureful function called
>>  PROFILE_LOOKBACK() which could be composed with an adjusted PROFILE_GET,
>>  whose arguments look like:
>>
>>     - profile - The name of the profile.
>>     - entity - The name of the entity.
>>     - timestamps - The list of timestamps to retrieve
>>     - groups_list - Optional, must correspond to the 'groupBy' list used in
>>     profile creation - List (in square brackets) of groupBy values used to
>>     filter the profile. Default is the empty list, meaning groupBy was not
>>  used
>>     when creating the profile.
>>     - config_overrides - Optional - Map (in curly braces) of name:value
>>     pairs, each overriding the global config parameter of the same name.
>>     Default is the empty Map, meaning no overrides.
>>
>>  So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it as
>>  its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
>>  PROFILE_LOOKBACK(...)) ).
>>
>>  *Motivation for Change*
>>
>>  The justification for this is that sometimes you want to compare time bins
>>  for a long duration back, but you don't want to skew the data by including
>>  periods that aren't distributionally similar (due to seasonal data, for
>>  instance). You might want to compare a value to statistically baseline of
>>  the median of the values for the same time window on the same day for the
>>  last month (e.g. every tuesday at this time).
>>
>>  Also, we might want a trailing window that does not start at the current
>>  time (in wall-clock), but rather starts an hour back or from the time that
>>  the data was originally ingested.
>>
>>  *PROFILE_LOOKBACK*
>>
>>  I propose that we support the following features:
>>
>>     - A starting point that is not current time
>>     - Sparse bins (i.e. the last hour for every tuesday for the last month)
>>     - The ability to skip events (e.g. weekends, holidays)
>>
>>  This would result in a new function with the following arguments:
>>
>>     -
>>
>>     from - The lookback starting point (default to now)
>>     -
>>
>>     fromUnits - The units for the lookback starting point
>>     -
>>
>>     to - The ending point for the lookback window (default to from +
>>  binSize)
>>     -
>>
>>     toUnits - The units for the lookback ending point
>>     -
>>
>>     including - A list of conditions which we would skip.
>>     - weekend
>>        - holiday
>>        - sunday through saturday
>>     -
>>
>>     excluding - A list of conditions which we would skip.
>>     - weekend
>>        - holiday
>>        - sunday through saturday
>>     -
>>
>>     binSize - The size of the lookback bin
>>     -
>>
>>     binUnits - The units of the lookback bin
>>
>>  Given the number of arguments and their complexity and the fact that many,
>>  many are optional, I propose that either
>>
>>     - PROFILE_LOOKBACK take a Map so that we can get essentially named
>>     params in stellar.
>>     - PROFILE_LOOKBACK accept a string backed by a DSL to express these
>>     criteria
>>
>>  Ok, so that's a lot to take in. How about we look at some motivating
>>  use-cases.
>>
>>  *Base Case: A lookback of 1 hour ago*
>>
>>  As a map, this would look like:
>>
>>  PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>>
>>  As a DSL this would look like:
>>  PROFILE_LOOKBACK( '1 hour bins from now')
>>
>>  *The same time window every tuesday for the last month starting one hour
>>  ago*
>>
>>  Just to make this as clear as possible, if this is run at 3PM on Monday
>>  January 23rd, 2017, it would include the following bins:
>>
>>     - January 17th, 2PM - 3PM
>>     - January 10th, 2PM - 3PM
>>     - January 3rd, 2PM - 3PM
>>     - December 27th, 2PM - 3PM
>>
>>  As a map, this would look like:
>>
>>  PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
>>  : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' : 'HOURS'
>>  } )
>>
>>  As a DSL this would look like:
>>  PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays')
>>
>>  *The same time window every sunday for the last month starting one hour ago
>>  skipping holidays*
>>
>>  Just to make this as clear as possible, if this is run at 3PM on Monday
>>  January 22rd, 2017, it would include the following bins:
>>
>>     - January 16th, 2PM - 3PM
>>     - January 9th, 2PM - 3PM
>>     - January 2rd, 2PM - 3PM
>>     - NOT December 25th
>>
>>  As a map, this would look like:
>>
>>  PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1, 'toUnits'
>>  : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
>>  'binSize' : 1, 'binUnits' : 'HOURS' } )
>>
>>  As a DSL this would look like:
>>  PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including tuesdays
>>  excluding holidays')
>>
>>  *DSL vs API*
>>
>>  So, here's my personal rundown of the two approaches:
>>
>>  DSL:
>>
>>     - PRO
>>     - Clear. As you can see, it reads like a sentence
>>        - Concise
>>     - CON:
>>        - More complex to implement
>>        - Another DSL to learn
>>
>>  API:
>>
>>     - PRO
>>        - Simpler to implement (though marginally so, IMO)
>>     - CON
>>        - A bit more complex to understand (also, IMO)
>>
>>  I'd like to solicit feedback from the community at this point:
>>
>>     - What do you think of this change?
>>     - Would you prefer the DSL, API or other approach?
>>
>>  Thanks,
>>
>>  Casey

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Mime
View raw message