metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Tue, 31 Jan 2017 18:04:42 GMT
>
>
> I share your concern at making another DSL, but
> cron seemed to not be a complete solution and it's syntax, despite being
> well known by admins, may not be well known to analysts.  Also, and this is
> just a personal bias, I find it inscrutable without a fair amount of
> wikipedia and man page reading.


Totally valid concerns.  I am thinking the pros outweigh the cons though
because...

   1. At least some percentage of our user base already knows it
   2. For those that don't know it, there already exists a ton of
   documentation on it
   3. Covers a ton of corner cases that we are not thinking about.  For
   example, last Friday of the month, the nearest weekday to the 15th of the
   month.
   4. It has well-known semantics proven over many years





On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <cestella@gmail.com> wrote:

> I actually did consider cron initially but dismissed it for the following
> reasons:
>
>    - Cron syntax allows you to construct only absolute lookbacks (i.e.
>    "every tuesday at 3PM" not "every tuesday at the current hour")
>    - Cron syntax allows you to specify a point in time, not a duration.  We
>    could, of course, specify a duration as another argument
>    - Cron syntax does not allow you to skip things like holidays, etc.
>
> You could use Cron syntax as part of a broader API to specify the days to
> look back and have other arguments handle the aspects that cron doesn't
> support out of the box.  I share your concern at making another DSL, but
> cron seemed to not be a complete solution and it's syntax, despite being
> well known by admins, may not be well known to analysts.  Also, and this is
> just a personal bias, I find it inscrutable without a fair amount of
> wikipedia and man page reading.
>
> On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <nick@nickallen.org> wrote:
>
> > I do prefer the flexibility of the DSL, but would prefer not to create
> yet
> > another DSL for our users to learn.  Couldn't we somehow use cron
> > expressions for this functionality?
> >
> > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <cestella@gmail.com>
> wrote:
> >
> > > Hi All,
> > >
> > > I'm planning to expand the capabilities of PROFILE_GET and wanted to
> pass
> > > an idea past the community.
> > >
> > > *Current State*
> > >
> > > Currently, the functionality of PROFILE_GET is fairly straightforward:
> > >
> > >    - profile - The name of the profile.
> > >    - entity - The name of the entity.
> > >    - durationAgo - How long ago should values be retrieved from?
> > >    - units - The units of 'durationAgo'.
> > >    - groups_list - Optional, must correspond to the 'groupBy' list used
> > in
> > >    profile creation - List (in square brackets) of groupBy values used
> to
> > >    filter the profile. Default is the empty list, meaning groupBy was
> not
> > > used
> > >    when creating the profile.
> > >    - config_overrides - Optional - Map (in curly braces) of name:value
> > >    pairs, each overriding the global config parameter of the same name.
> > >    Default is the empty Map, meaning no overrides.
> > >
> > > This has the advantage of providing a relatively simple mechanism to
> > > support the dominant use-case, gathering the profiles for a trailing
> > > window.  The issues, however, are a couple:
> > >
> > >    - We may need more complex semantics for specifying the window
> > >    (motivated below)
> > >    - As such, this couples the gathering of the profiles with the
> > >    specification of the window.
> > >
> > > I propose to decouple these two concepts. I propose that we extract the
> > > notion of the lookback into a separate, more featureful function called
> > > PROFILE_LOOKBACK() which could be composed with an adjusted
> PROFILE_GET,
> > > whose arguments look like:
> > >
> > >
> > >    - profile - The name of the profile.
> > >    - entity - The name of the entity.
> > >    - timestamps - The list of timestamps to retrieve
> > >    - groups_list - Optional, must correspond to the 'groupBy' list used
> > in
> > >    profile creation - List (in square brackets) of groupBy values used
> to
> > >    filter the profile. Default is the empty list, meaning groupBy was
> not
> > > used
> > >    when creating the profile.
> > >    - config_overrides - Optional - Map (in curly braces) of name:value
> > >    pairs, each overriding the global config parameter of the same name.
> > >    Default is the empty Map, meaning no overrides.
> > >
> > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to it
> as
> > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
> > > PROFILE_LOOKBACK(...)) ).
> > >
> > > *Motivation for Change*
> > >
> > > The justification for this is that sometimes you want to compare time
> > bins
> > > for a long duration back, but you don't want to skew the data by
> > including
> > > periods that aren't distributionally similar (due to seasonal data, for
> > > instance).  You might want to compare a value to statistically baseline
> > of
> > > the median of the values for the same time window on the same day for
> the
> > > last month (e.g. every tuesday at this time).
> > >
> > > Also, we might want a trailing window that does not start at the
> current
> > > time (in wall-clock), but rather starts an hour back or from the time
> > that
> > > the data was originally ingested.
> > >
> > >
> > > *PROFILE_LOOKBACK*
> > >
> > > I propose that we support the following features:
> > >
> > >    - A starting point that is not current time
> > >    - Sparse bins (i.e. the last hour for every tuesday for the last
> > month)
> > >    - The ability to skip events (e.g. weekends, holidays)
> > >
> > >
> > > This would result in a new function with the following arguments:
> > >
> > >    -
> > >
> > >    from - The lookback starting point (default to now)
> > >    -
> > >
> > >    fromUnits - The units for the lookback starting point
> > >    -
> > >
> > >    to - The ending point for the lookback window (default to from +
> > > binSize)
> > >    -
> > >
> > >    toUnits - The units for the lookback ending point
> > >    -
> > >
> > >    including - A list of conditions which we would skip.
> > >    - weekend
> > >       - holiday
> > >       - sunday through saturday
> > >    -
> > >
> > >    excluding - A list of conditions which we would skip.
> > >    - weekend
> > >       - holiday
> > >       - sunday through saturday
> > >    -
> > >
> > >    binSize - The size of the lookback bin
> > >    -
> > >
> > >    binUnits - The units of the lookback bin
> > >
> > > Given the number of arguments and their complexity and the fact that
> > many,
> > > many are optional, I propose that either
> > >
> > >    - PROFILE_LOOKBACK take a Map so that we can get essentially named
> > >    params in stellar.
> > >    - PROFILE_LOOKBACK accept a string backed by a DSL to express these
> > >    criteria
> > >
> > >
> > > Ok, so that's a lot to take in.  How about we look at some motivating
> > > use-cases.
> > >
> > > *Base Case: A lookback of 1 hour ago*
> > >
> > > As a map, this would look like:
> > >
> > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
> > >
> > > As a DSL this would look like:
> > > PROFILE_LOOKBACK( '1 hour bins from now')
> > >
> > >
> > > *The same time window every tuesday for the last month starting one
> hour
> > > ago*
> > >
> > > Just to make this as clear as possible, if this is run at 3PM on Monday
> > > January 23rd, 2017, it would include the following bins:
> > >
> > >    - January 17th, 2PM - 3PM
> > >    - January 10th, 2PM - 3PM
> > >    - January 3rd, 2PM - 3PM
> > >    - December 27th, 2PM - 3PM
> > >
> > > As a map, this would look like:
> > >
> > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
> > 'toUnits'
> > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits' :
> > 'HOURS'
> > > } )
> > >
> > > As a DSL this would look like:
> > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
> > tuesdays')
> > >
> > > *The same time window every sunday for the last month starting one hour
> > ago
> > > skipping holidays*
> > >
> > > Just to make this as clear as possible, if this is run at 3PM on Monday
> > > January 22rd, 2017, it would include the following bins:
> > >
> > >    - January 16th, 2PM - 3PM
> > >    - January 9th, 2PM - 3PM
> > >    - January 2rd, 2PM - 3PM
> > >    - NOT December 25th
> > >
> > > As a map, this would look like:
> > >
> > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
> > 'toUnits'
> > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays' ],
> > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
> > >
> > > As a DSL this would look like:
> > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
> tuesdays
> > > excluding holidays')
> > >
> > > *DSL vs API*
> > >
> > > So, here's my personal rundown of the two approaches:
> > >
> > > DSL:
> > >
> > >    - PRO
> > >    - Clear.  As you can see, it reads like a sentence
> > >       - Concise
> > >    - CON:
> > >       - More complex to implement
> > >       - Another DSL to learn
> > >
> > > API:
> > >
> > >    - PRO
> > >       - Simpler to implement (though marginally so, IMO)
> > >    - CON
> > >       - A bit more complex to understand (also, IMO)
> > >
> > > I'd like to solicit feedback from the community at this point:
> > >
> > >    - What do you think of this change?
> > >    - Would you prefer the DSL, API or other approach?
> > >
> > > Thanks,
> > >
> > > Casey
> > >
> >
> >
> >
> > --
> > Nick Allen <nick@nickallen.org>
> >
>



-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message