metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Tue, 31 Jan 2017 19:19:46 GMT
Casey, this sounds great.  

1. I think you have a model for how to do the natural-language-look DSL, that maybe isn’t
clear to the rest of us.  Does your contemplated approach meet the following?
a) Can it be specified as a formal grammar, not too complex, so anyone who bothers can visually
parse a sentence and see that it is conforming (or not)? – Obviously I’m concerned that
we not bite off a full NLP problem for the sake of ease of use.
b) Can it dump a human-understandable parse tree on request, so if it does the unexpected
(or just fails to work), the user can easily figure out what the program did and why?
c) And you already said it wouldn’t be much more complex to implement than using a Map parameter
list.

Given those 3 things, I’d vote for the NL, otherwise the parameter list.

2. One more thing I’d recommend for PROFILE_LOOKBACK():  wildcarding of Groups.  
- Easy to implement for groups based on small finite sets like day-of-week.
- Also easy altho perhaps slow for large finite sets, and multi-dimensional group sub-keys.

- Not clear how to do for indeterminate sets from random Stellar functions, but we should
talk to HBase gurus and see what we can do, using HBase’s scan capability.
- If groups sub-keys are not already at the end of the row key structure, we may want to move
them there, to make the latter case easier and more efficient.

3. There’s another thing I’ve been noodling, that deserves its own discussion, but is
relevant to mention here – or if you really like it, could be included in PROFILE_LOOKBACK
project, as it’s not a difficult thing:
Currently the Profiler config settings, at the time a Profile is run, get burned into the
HBase row keys, such that if you don’t _a priori_ know those config settings, you can’t
read the Profile.  I’d like to suggest we start saving Profile metadata, also in HBase,
every time a new Profile is started (and, if possible, ended) so that I can say “Get that
old profile from November 14-21, whatever its metadata was” without needing to know its
period, groups, etc.  Obviously it would be nice to be able to query the metadata itself,
too, but just having the metadata gives us the right start.

--Matt

On 1/31/17, 10:43 AM, "Casey Stella" <cestella@gmail.com> wrote:

    Regarding the "?" syntax:
    Wouldn't that be forking cron syntax so now we have a metron cron?  If
    we're constructing our own syntax, then why not do it so that it reads like
    natural language?
    
    Regarding the holiday problem:
    Agreed, it's a smaller problem than constructing a DSL, but that's not
    really the point, I think. The concern is that it would be unable to be
    expressed using cron syntax in a natural way without modifying cron syntax,
    which would be constructing a new DSL.  If quartz has a clever way of doing
    that, then I'd like to see it.  From a quick search, I haven't seen a
    scheduling example with a compact syntax that shows skipping holidays with
    cron syntax.
    
    
    On Tue, Jan 31, 2017 at 1:29 PM, Nick Allen <nick@nickallen.org> wrote:
    
    > >
    > >    - Cron syntax allows you to construct only absolute lookbacks (i.e.
    > >    "every tuesday at 3PM" not "every tuesday at the current hour")
    >
    >
    > I think Cron would work for this.  I am no expert on cron expressions, but
    > I think the following examples would work.
    >
    >    - If you want "every Tuesday at 3 PM"
    >       - 0 0 15 ? * TUE *
    >    - If you want "every Tuesday at current hour" then use something like
    >    the "?" placeholder maybe.
    >       - 0 0 ? ? * TUE *
    >
    > - Cron syntax allows you to specify a point in time, not a duration.  We
    > >    could, of course, specify a duration as another argument
    >
    >
    > Yes, a separate argument would be necessary.  We would have to allow the
    > user to specify either a "start from date/time" or the "number of intervals
    > to look back".
    >
    > Cron syntax does not allow you to skip things like holidays, etc.
    >
    >
    > I agree, out-of-the-box Cron does not solve holiday calendars.  But this
    > would be a smaller problem to solve then creating our own DSL.
    >
    > There is a tradition of creating shortcuts that look something like @Daily
    > or @Weekdays or @Tuesdays that we could also use to make things easier for
    > users.
    >
    > I have used Quartz with cron expressions in the past and there was some way
    > to handle holidays with that.  I think you could create a custom calendar
    > for the holidays and call it something; aka @USHolidays.  And then you
    > would say "every Tuesday" except @USHolidays or something like that.  I'd
    > have to look into this some more.
    >
    > And there are also nice online Cron expression "translators" that we could
    > mimic in a Metron user interface.  For example, https://crontab.guru.
    >
    >
    >
    >
    > On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <cestella@gmail.com> wrote:
    >
    > > I actually did consider cron initially but dismissed it for the following
    > > reasons:
    > >
    > >    - Cron syntax allows you to construct only absolute lookbacks (i.e.
    > >    "every tuesday at 3PM" not "every tuesday at the current hour")
    > >    - Cron syntax allows you to specify a point in time, not a duration.
    > We
    > >    could, of course, specify a duration as another argument
    > >    - Cron syntax does not allow you to skip things like holidays, etc.
    > >
    > > You could use Cron syntax as part of a broader API to specify the days to
    > > look back and have other arguments handle the aspects that cron doesn't
    > > support out of the box.  I share your concern at making another DSL, but
    > > cron seemed to not be a complete solution and it's syntax, despite being
    > > well known by admins, may not be well known to analysts.  Also, and this
    > is
    > > just a personal bias, I find it inscrutable without a fair amount of
    > > wikipedia and man page reading.
    > >
    > > On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <nick@nickallen.org> wrote:
    > >
    > > > I do prefer the flexibility of the DSL, but would prefer not to create
    > > yet
    > > > another DSL for our users to learn.  Couldn't we somehow use cron
    > > > expressions for this functionality?
    > > >
    > > > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <cestella@gmail.com>
    > > wrote:
    > > >
    > > > > Hi All,
    > > > >
    > > > > I'm planning to expand the capabilities of PROFILE_GET and wanted
to
    > > pass
    > > > > an idea past the community.
    > > > >
    > > > > *Current State*
    > > > >
    > > > > Currently, the functionality of PROFILE_GET is fairly
    > straightforward:
    > > > >
    > > > >    - profile - The name of the profile.
    > > > >    - entity - The name of the entity.
    > > > >    - durationAgo - How long ago should values be retrieved from?
    > > > >    - units - The units of 'durationAgo'.
    > > > >    - groups_list - Optional, must correspond to the 'groupBy' list
    > used
    > > > in
    > > > >    profile creation - List (in square brackets) of groupBy values
    > used
    > > to
    > > > >    filter the profile. Default is the empty list, meaning groupBy
was
    > > not
    > > > > used
    > > > >    when creating the profile.
    > > > >    - config_overrides - Optional - Map (in curly braces) of
    > name:value
    > > > >    pairs, each overriding the global config parameter of the same
    > name.
    > > > >    Default is the empty Map, meaning no overrides.
    > > > >
    > > > > This has the advantage of providing a relatively simple mechanism
to
    > > > > support the dominant use-case, gathering the profiles for a trailing
    > > > > window.  The issues, however, are a couple:
    > > > >
    > > > >    - We may need more complex semantics for specifying the window
    > > > >    (motivated below)
    > > > >    - As such, this couples the gathering of the profiles with the
    > > > >    specification of the window.
    > > > >
    > > > > I propose to decouple these two concepts. I propose that we extract
    > the
    > > > > notion of the lookback into a separate, more featureful function
    > called
    > > > > PROFILE_LOOKBACK() which could be composed with an adjusted
    > > PROFILE_GET,
    > > > > whose arguments look like:
    > > > >
    > > > >
    > > > >    - profile - The name of the profile.
    > > > >    - entity - The name of the entity.
    > > > >    - timestamps - The list of timestamps to retrieve
    > > > >    - groups_list - Optional, must correspond to the 'groupBy' list
    > used
    > > > in
    > > > >    profile creation - List (in square brackets) of groupBy values
    > used
    > > to
    > > > >    filter the profile. Default is the empty list, meaning groupBy
was
    > > not
    > > > > used
    > > > >    when creating the profile.
    > > > >    - config_overrides - Optional - Map (in curly braces) of
    > name:value
    > > > >    pairs, each overriding the global config parameter of the same
    > name.
    > > > >    Default is the empty Map, meaning no overrides.
    > > > >
    > > > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed to
    > it
    > > as
    > > > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
    > > > > PROFILE_LOOKBACK(...)) ).
    > > > >
    > > > > *Motivation for Change*
    > > > >
    > > > > The justification for this is that sometimes you want to compare time
    > > > bins
    > > > > for a long duration back, but you don't want to skew the data by
    > > > including
    > > > > periods that aren't distributionally similar (due to seasonal data,
    > for
    > > > > instance).  You might want to compare a value to statistically
    > baseline
    > > > of
    > > > > the median of the values for the same time window on the same day
for
    > > the
    > > > > last month (e.g. every tuesday at this time).
    > > > >
    > > > > Also, we might want a trailing window that does not start at the
    > > current
    > > > > time (in wall-clock), but rather starts an hour back or from the time
    > > > that
    > > > > the data was originally ingested.
    > > > >
    > > > >
    > > > > *PROFILE_LOOKBACK*
    > > > >
    > > > > I propose that we support the following features:
    > > > >
    > > > >    - A starting point that is not current time
    > > > >    - Sparse bins (i.e. the last hour for every tuesday for the last
    > > > month)
    > > > >    - The ability to skip events (e.g. weekends, holidays)
    > > > >
    > > > >
    > > > > This would result in a new function with the following arguments:
    > > > >
    > > > >    -
    > > > >
    > > > >    from - The lookback starting point (default to now)
    > > > >    -
    > > > >
    > > > >    fromUnits - The units for the lookback starting point
    > > > >    -
    > > > >
    > > > >    to - The ending point for the lookback window (default to from
+
    > > > > binSize)
    > > > >    -
    > > > >
    > > > >    toUnits - The units for the lookback ending point
    > > > >    -
    > > > >
    > > > >    including - A list of conditions which we would skip.
    > > > >    - weekend
    > > > >       - holiday
    > > > >       - sunday through saturday
    > > > >    -
    > > > >
    > > > >    excluding - A list of conditions which we would skip.
    > > > >    - weekend
    > > > >       - holiday
    > > > >       - sunday through saturday
    > > > >    -
    > > > >
    > > > >    binSize - The size of the lookback bin
    > > > >    -
    > > > >
    > > > >    binUnits - The units of the lookback bin
    > > > >
    > > > > Given the number of arguments and their complexity and the fact that
    > > > many,
    > > > > many are optional, I propose that either
    > > > >
    > > > >    - PROFILE_LOOKBACK take a Map so that we can get essentially named
    > > > >    params in stellar.
    > > > >    - PROFILE_LOOKBACK accept a string backed by a DSL to express
    > these
    > > > >    criteria
    > > > >
    > > > >
    > > > > Ok, so that's a lot to take in.  How about we look at some motivating
    > > > > use-cases.
    > > > >
    > > > > *Base Case: A lookback of 1 hour ago*
    > > > >
    > > > > As a map, this would look like:
    > > > >
    > > > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
    > > > >
    > > > > As a DSL this would look like:
    > > > > PROFILE_LOOKBACK( '1 hour bins from now')
    > > > >
    > > > >
    > > > > *The same time window every tuesday for the last month starting one
    > > hour
    > > > > ago*
    > > > >
    > > > > Just to make this as clear as possible, if this is run at 3PM on
    > Monday
    > > > > January 23rd, 2017, it would include the following bins:
    > > > >
    > > > >    - January 17th, 2PM - 3PM
    > > > >    - January 10th, 2PM - 3PM
    > > > >    - January 3rd, 2PM - 3PM
    > > > >    - December 27th, 2PM - 3PM
    > > > >
    > > > > As a map, this would look like:
    > > > >
    > > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
    > > > 'toUnits'
    > > > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits'
:
    > > > 'HOURS'
    > > > > } )
    > > > >
    > > > > As a DSL this would look like:
    > > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
    > > > tuesdays')
    > > > >
    > > > > *The same time window every sunday for the last month starting one
    > hour
    > > > ago
    > > > > skipping holidays*
    > > > >
    > > > > Just to make this as clear as possible, if this is run at 3PM on
    > Monday
    > > > > January 22rd, 2017, it would include the following bins:
    > > > >
    > > > >    - January 16th, 2PM - 3PM
    > > > >    - January 9th, 2PM - 3PM
    > > > >    - January 2rd, 2PM - 3PM
    > > > >    - NOT December 25th
    > > > >
    > > > > As a map, this would look like:
    > > > >
    > > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to' : 1,
    > > > 'toUnits'
    > > > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays'
],
    > > > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
    > > > >
    > > > > As a DSL this would look like:
    > > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
    > > tuesdays
    > > > > excluding holidays')
    > > > >
    > > > > *DSL vs API*
    > > > >
    > > > > So, here's my personal rundown of the two approaches:
    > > > >
    > > > > DSL:
    > > > >
    > > > >    - PRO
    > > > >    - Clear.  As you can see, it reads like a sentence
    > > > >       - Concise
    > > > >    - CON:
    > > > >       - More complex to implement
    > > > >       - Another DSL to learn
    > > > >
    > > > > API:
    > > > >
    > > > >    - PRO
    > > > >       - Simpler to implement (though marginally so, IMO)
    > > > >    - CON
    > > > >       - A bit more complex to understand (also, IMO)
    > > > >
    > > > > I'd like to solicit feedback from the community at this point:
    > > > >
    > > > >    - What do you think of this change?
    > > > >    - Would you prefer the DSL, API or other approach?
    > > > >
    > > > > Thanks,
    > > > >
    > > > > Casey
    > > > >
    > > >
    > > >
    > > >
    > > > --
    > > > Nick Allen <nick@nickallen.org>
    > > >
    > >
    >
    >
    >
    > --
    > Nick Allen <nick@nickallen.org>
    >
    




Mime
View raw message