metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Tue, 31 Jan 2017 19:23:02 GMT
1. Yes, in my mind I was imagining a formal grammar specified by antlr,
like we use in stellar with all the accompanying tooling that comes from an
antlr grammar.

2 and 3 are worth pondering.  I'll reserve comment until I've pondered. ;)


On Tue, Jan 31, 2017 at 2:19 PM, Matt Foley <mattf@apache.org> wrote:

> Casey, this sounds great.
>
> 1. I think you have a model for how to do the natural-language-look DSL,
> that maybe isn’t clear to the rest of us.  Does your contemplated approach
> meet the following?
> a) Can it be specified as a formal grammar, not too complex, so anyone who
> bothers can visually parse a sentence and see that it is conforming (or
> not)? – Obviously I’m concerned that we not bite off a full NLP problem for
> the sake of ease of use.
> b) Can it dump a human-understandable parse tree on request, so if it does
> the unexpected (or just fails to work), the user can easily figure out what
> the program did and why?
> c) And you already said it wouldn’t be much more complex to implement than
> using a Map parameter list.
>
> Given those 3 things, I’d vote for the NL, otherwise the parameter list.
>
> 2. One more thing I’d recommend for PROFILE_LOOKBACK():  wildcarding of
> Groups.
> - Easy to implement for groups based on small finite sets like day-of-week.
> - Also easy altho perhaps slow for large finite sets, and
> multi-dimensional group sub-keys.
> - Not clear how to do for indeterminate sets from random Stellar
> functions, but we should talk to HBase gurus and see what we can do, using
> HBase’s scan capability.
> - If groups sub-keys are not already at the end of the row key structure,
> we may want to move them there, to make the latter case easier and more
> efficient.
>
> 3. There’s another thing I’ve been noodling, that deserves its own
> discussion, but is relevant to mention here – or if you really like it,
> could be included in PROFILE_LOOKBACK project, as it’s not a difficult
> thing:
> Currently the Profiler config settings, at the time a Profile is run, get
> burned into the HBase row keys, such that if you don’t _a priori_ know
> those config settings, you can’t read the Profile.  I’d like to suggest we
> start saving Profile metadata, also in HBase, every time a new Profile is
> started (and, if possible, ended) so that I can say “Get that old profile
> from November 14-21, whatever its metadata was” without needing to know its
> period, groups, etc.  Obviously it would be nice to be able to query the
> metadata itself, too, but just having the metadata gives us the right start.
>
> --Matt
>
> On 1/31/17, 10:43 AM, "Casey Stella" <cestella@gmail.com> wrote:
>
>     Regarding the "?" syntax:
>     Wouldn't that be forking cron syntax so now we have a metron cron?  If
>     we're constructing our own syntax, then why not do it so that it reads
> like
>     natural language?
>
>     Regarding the holiday problem:
>     Agreed, it's a smaller problem than constructing a DSL, but that's not
>     really the point, I think. The concern is that it would be unable to be
>     expressed using cron syntax in a natural way without modifying cron
> syntax,
>     which would be constructing a new DSL.  If quartz has a clever way of
> doing
>     that, then I'd like to see it.  From a quick search, I haven't seen a
>     scheduling example with a compact syntax that shows skipping holidays
> with
>     cron syntax.
>
>
>     On Tue, Jan 31, 2017 at 1:29 PM, Nick Allen <nick@nickallen.org>
> wrote:
>
>     > >
>     > >    - Cron syntax allows you to construct only absolute lookbacks
> (i.e.
>     > >    "every tuesday at 3PM" not "every tuesday at the current hour")
>     >
>     >
>     > I think Cron would work for this.  I am no expert on cron
> expressions, but
>     > I think the following examples would work.
>     >
>     >    - If you want "every Tuesday at 3 PM"
>     >       - 0 0 15 ? * TUE *
>     >    - If you want "every Tuesday at current hour" then use something
> like
>     >    the "?" placeholder maybe.
>     >       - 0 0 ? ? * TUE *
>     >
>     > - Cron syntax allows you to specify a point in time, not a
> duration.  We
>     > >    could, of course, specify a duration as another argument
>     >
>     >
>     > Yes, a separate argument would be necessary.  We would have to allow
> the
>     > user to specify either a "start from date/time" or the "number of
> intervals
>     > to look back".
>     >
>     > Cron syntax does not allow you to skip things like holidays, etc.
>     >
>     >
>     > I agree, out-of-the-box Cron does not solve holiday calendars.  But
> this
>     > would be a smaller problem to solve then creating our own DSL.
>     >
>     > There is a tradition of creating shortcuts that look something like
> @Daily
>     > or @Weekdays or @Tuesdays that we could also use to make things
> easier for
>     > users.
>     >
>     > I have used Quartz with cron expressions in the past and there was
> some way
>     > to handle holidays with that.  I think you could create a custom
> calendar
>     > for the holidays and call it something; aka @USHolidays.  And then
> you
>     > would say "every Tuesday" except @USHolidays or something like
> that.  I'd
>     > have to look into this some more.
>     >
>     > And there are also nice online Cron expression "translators" that we
> could
>     > mimic in a Metron user interface.  For example, https://crontab.guru
> .
>     >
>     >
>     >
>     >
>     > On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <cestella@gmail.com>
> wrote:
>     >
>     > > I actually did consider cron initially but dismissed it for the
> following
>     > > reasons:
>     > >
>     > >    - Cron syntax allows you to construct only absolute lookbacks
> (i.e.
>     > >    "every tuesday at 3PM" not "every tuesday at the current hour")
>     > >    - Cron syntax allows you to specify a point in time, not a
> duration.
>     > We
>     > >    could, of course, specify a duration as another argument
>     > >    - Cron syntax does not allow you to skip things like holidays,
> etc.
>     > >
>     > > You could use Cron syntax as part of a broader API to specify the
> days to
>     > > look back and have other arguments handle the aspects that cron
> doesn't
>     > > support out of the box.  I share your concern at making another
> DSL, but
>     > > cron seemed to not be a complete solution and it's syntax, despite
> being
>     > > well known by admins, may not be well known to analysts.  Also,
> and this
>     > is
>     > > just a personal bias, I find it inscrutable without a fair amount
> of
>     > > wikipedia and man page reading.
>     > >
>     > > On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <nick@nickallen.org>
> wrote:
>     > >
>     > > > I do prefer the flexibility of the DSL, but would prefer not to
> create
>     > > yet
>     > > > another DSL for our users to learn.  Couldn't we somehow use cron
>     > > > expressions for this functionality?
>     > > >
>     > > > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <
> cestella@gmail.com>
>     > > wrote:
>     > > >
>     > > > > Hi All,
>     > > > >
>     > > > > I'm planning to expand the capabilities of PROFILE_GET and
> wanted to
>     > > pass
>     > > > > an idea past the community.
>     > > > >
>     > > > > *Current State*
>     > > > >
>     > > > > Currently, the functionality of PROFILE_GET is fairly
>     > straightforward:
>     > > > >
>     > > > >    - profile - The name of the profile.
>     > > > >    - entity - The name of the entity.
>     > > > >    - durationAgo - How long ago should values be retrieved
> from?
>     > > > >    - units - The units of 'durationAgo'.
>     > > > >    - groups_list - Optional, must correspond to the 'groupBy'
> list
>     > used
>     > > > in
>     > > > >    profile creation - List (in square brackets) of groupBy
> values
>     > used
>     > > to
>     > > > >    filter the profile. Default is the empty list, meaning
> groupBy was
>     > > not
>     > > > > used
>     > > > >    when creating the profile.
>     > > > >    - config_overrides - Optional - Map (in curly braces) of
>     > name:value
>     > > > >    pairs, each overriding the global config parameter of the
> same
>     > name.
>     > > > >    Default is the empty Map, meaning no overrides.
>     > > > >
>     > > > > This has the advantage of providing a relatively simple
> mechanism to
>     > > > > support the dominant use-case, gathering the profiles for a
> trailing
>     > > > > window.  The issues, however, are a couple:
>     > > > >
>     > > > >    - We may need more complex semantics for specifying the
> window
>     > > > >    (motivated below)
>     > > > >    - As such, this couples the gathering of the profiles with
> the
>     > > > >    specification of the window.
>     > > > >
>     > > > > I propose to decouple these two concepts. I propose that we
> extract
>     > the
>     > > > > notion of the lookback into a separate, more featureful
> function
>     > called
>     > > > > PROFILE_LOOKBACK() which could be composed with an adjusted
>     > > PROFILE_GET,
>     > > > > whose arguments look like:
>     > > > >
>     > > > >
>     > > > >    - profile - The name of the profile.
>     > > > >    - entity - The name of the entity.
>     > > > >    - timestamps - The list of timestamps to retrieve
>     > > > >    - groups_list - Optional, must correspond to the 'groupBy'
> list
>     > used
>     > > > in
>     > > > >    profile creation - List (in square brackets) of groupBy
> values
>     > used
>     > > to
>     > > > >    filter the profile. Default is the empty list, meaning
> groupBy was
>     > > not
>     > > > > used
>     > > > >    when creating the profile.
>     > > > >    - config_overrides - Optional - Map (in curly braces) of
>     > name:value
>     > > > >    pairs, each overriding the global config parameter of the
> same
>     > name.
>     > > > >    Default is the empty Map, meaning no overrides.
>     > > > >
>     > > > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK
> passed to
>     > it
>     > > as
>     > > > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
>     > > > > PROFILE_LOOKBACK(...)) ).
>     > > > >
>     > > > > *Motivation for Change*
>     > > > >
>     > > > > The justification for this is that sometimes you want to
> compare time
>     > > > bins
>     > > > > for a long duration back, but you don't want to skew the data
> by
>     > > > including
>     > > > > periods that aren't distributionally similar (due to seasonal
> data,
>     > for
>     > > > > instance).  You might want to compare a value to statistically
>     > baseline
>     > > > of
>     > > > > the median of the values for the same time window on the same
> day for
>     > > the
>     > > > > last month (e.g. every tuesday at this time).
>     > > > >
>     > > > > Also, we might want a trailing window that does not start at
> the
>     > > current
>     > > > > time (in wall-clock), but rather starts an hour back or from
> the time
>     > > > that
>     > > > > the data was originally ingested.
>     > > > >
>     > > > >
>     > > > > *PROFILE_LOOKBACK*
>     > > > >
>     > > > > I propose that we support the following features:
>     > > > >
>     > > > >    - A starting point that is not current time
>     > > > >    - Sparse bins (i.e. the last hour for every tuesday for the
> last
>     > > > month)
>     > > > >    - The ability to skip events (e.g. weekends, holidays)
>     > > > >
>     > > > >
>     > > > > This would result in a new function with the following
> arguments:
>     > > > >
>     > > > >    -
>     > > > >
>     > > > >    from - The lookback starting point (default to now)
>     > > > >    -
>     > > > >
>     > > > >    fromUnits - The units for the lookback starting point
>     > > > >    -
>     > > > >
>     > > > >    to - The ending point for the lookback window (default to
> from +
>     > > > > binSize)
>     > > > >    -
>     > > > >
>     > > > >    toUnits - The units for the lookback ending point
>     > > > >    -
>     > > > >
>     > > > >    including - A list of conditions which we would skip.
>     > > > >    - weekend
>     > > > >       - holiday
>     > > > >       - sunday through saturday
>     > > > >    -
>     > > > >
>     > > > >    excluding - A list of conditions which we would skip.
>     > > > >    - weekend
>     > > > >       - holiday
>     > > > >       - sunday through saturday
>     > > > >    -
>     > > > >
>     > > > >    binSize - The size of the lookback bin
>     > > > >    -
>     > > > >
>     > > > >    binUnits - The units of the lookback bin
>     > > > >
>     > > > > Given the number of arguments and their complexity and the
> fact that
>     > > > many,
>     > > > > many are optional, I propose that either
>     > > > >
>     > > > >    - PROFILE_LOOKBACK take a Map so that we can get
> essentially named
>     > > > >    params in stellar.
>     > > > >    - PROFILE_LOOKBACK accept a string backed by a DSL to
> express
>     > these
>     > > > >    criteria
>     > > > >
>     > > > >
>     > > > > Ok, so that's a lot to take in.  How about we look at some
> motivating
>     > > > > use-cases.
>     > > > >
>     > > > > *Base Case: A lookback of 1 hour ago*
>     > > > >
>     > > > > As a map, this would look like:
>     > > > >
>     > > > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>     > > > >
>     > > > > As a DSL this would look like:
>     > > > > PROFILE_LOOKBACK( '1 hour bins from now')
>     > > > >
>     > > > >
>     > > > > *The same time window every tuesday for the last month
> starting one
>     > > hour
>     > > > > ago*
>     > > > >
>     > > > > Just to make this as clear as possible, if this is run at 3PM
> on
>     > Monday
>     > > > > January 23rd, 2017, it would include the following bins:
>     > > > >
>     > > > >    - January 17th, 2PM - 3PM
>     > > > >    - January 10th, 2PM - 3PM
>     > > > >    - January 3rd, 2PM - 3PM
>     > > > >    - December 27th, 2PM - 3PM
>     > > > >
>     > > > > As a map, this would look like:
>     > > > >
>     > > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' :
> 1,
>     > > > 'toUnits'
>     > > > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1,
> 'binUnits' :
>     > > > 'HOURS'
>     > > > > } )
>     > > > >
>     > > > > As a DSL this would look like:
>     > > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>     > > > tuesdays')
>     > > > >
>     > > > > *The same time window every sunday for the last month starting
> one
>     > hour
>     > > > ago
>     > > > > skipping holidays*
>     > > > >
>     > > > > Just to make this as clear as possible, if this is run at 3PM
> on
>     > Monday
>     > > > > January 22rd, 2017, it would include the following bins:
>     > > > >
>     > > > >    - January 16th, 2PM - 3PM
>     > > > >    - January 9th, 2PM - 3PM
>     > > > >    - January 2rd, 2PM - 3PM
>     > > > >    - NOT December 25th
>     > > > >
>     > > > > As a map, this would look like:
>     > > > >
>     > > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' :
> 1,
>     > > > 'toUnits'
>     > > > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [
> 'holidays' ],
>     > > > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
>     > > > >
>     > > > > As a DSL this would look like:
>     > > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>     > > tuesdays
>     > > > > excluding holidays')
>     > > > >
>     > > > > *DSL vs API*
>     > > > >
>     > > > > So, here's my personal rundown of the two approaches:
>     > > > >
>     > > > > DSL:
>     > > > >
>     > > > >    - PRO
>     > > > >    - Clear.  As you can see, it reads like a sentence
>     > > > >       - Concise
>     > > > >    - CON:
>     > > > >       - More complex to implement
>     > > > >       - Another DSL to learn
>     > > > >
>     > > > > API:
>     > > > >
>     > > > >    - PRO
>     > > > >       - Simpler to implement (though marginally so, IMO)
>     > > > >    - CON
>     > > > >       - A bit more complex to understand (also, IMO)
>     > > > >
>     > > > > I'd like to solicit feedback from the community at this point:
>     > > > >
>     > > > >    - What do you think of this change?
>     > > > >    - Would you prefer the DSL, API or other approach?
>     > > > >
>     > > > > Thanks,
>     > > > >
>     > > > > Casey
>     > > > >
>     > > >
>     > > >
>     > > >
>     > > > --
>     > > > Nick Allen <nick@nickallen.org>
>     > > >
>     > >
>     >
>     >
>     >
>     > --
>     > Nick Allen <nick@nickallen.org>
>     >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message