metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Tue, 31 Jan 2017 19:13:38 GMT
We would not build any code that knows how to interpret cron expressions.
We are way too lazy for that.  We would use another library that already
knows Cron; like Quartz.

On Tue, Jan 31, 2017 at 2:07 PM, James Sirota <jsirota@apache.org> wrote:

> I am not sure we need to incorporate the cron.  Whatever scheduler we use
> to trigger the profile_get/lookup should have the cron function built in.
> I think that's out of scope for us.  I think what you defined in your
> original post at the start of this thread is exactly what we need
>
> 31.01.2017, 12:02, "Casey Stella" <cestella@gmail.com>:
> > One more point, one of the reasons for decoupling the PROFILE_GET from
> > PROFILE_LOOKUP means that we could ahve alternative implementations of
> > PROFILE_LOOKUP. We could have a PROFILE_LOOKUP_CRON as well.
> >
> > On Tue, Jan 31, 2017 at 1:43 PM, Casey Stella <cestella@gmail.com>
> wrote:
> >
> >>  Regarding the "?" syntax:
> >>  Wouldn't that be forking cron syntax so now we have a metron cron? If
> >>  we're constructing our own syntax, then why not do it so that it reads
> like
> >>  natural language?
> >>
> >>  Regarding the holiday problem:
> >>  Agreed, it's a smaller problem than constructing a DSL, but that's not
> >>  really the point, I think. The concern is that it would be unable to be
> >>  expressed using cron syntax in a natural way without modifying cron
> syntax,
> >>  which would be constructing a new DSL. If quartz has a clever way of
> doing
> >>  that, then I'd like to see it. From a quick search, I haven't seen a
> >>  scheduling example with a compact syntax that shows skipping holidays
> with
> >>  cron syntax.
> >>
> >>  On Tue, Jan 31, 2017 at 1:29 PM, Nick Allen <nick@nickallen.org>
> wrote:
> >>
> >>>  >
> >>>  > - Cron syntax allows you to construct only absolute lookbacks (i.e.
> >>>  > "every tuesday at 3PM" not "every tuesday at the current hour")
> >>>
> >>>  I think Cron would work for this. I am no expert on cron expressions,
> but
> >>>  I think the following examples would work.
> >>>
> >>>     - If you want "every Tuesday at 3 PM"
> >>>        - 0 0 15 ? * TUE *
> >>>     - If you want "every Tuesday at current hour" then use something
> like
> >>>     the "?" placeholder maybe.
> >>>        - 0 0 ? ? * TUE *
> >>>
> >>>  - Cron syntax allows you to specify a point in time, not a duration.
> We
> >>>  > could, of course, specify a duration as another argument
> >>>
> >>>  Yes, a separate argument would be necessary. We would have to allow
> the
> >>>  user to specify either a "start from date/time" or the "number of
> >>>  intervals
> >>>  to look back".
> >>>
> >>>  Cron syntax does not allow you to skip things like holidays, etc.
> >>>
> >>>  I agree, out-of-the-box Cron does not solve holiday calendars. But
> this
> >>>  would be a smaller problem to solve then creating our own DSL.
> >>>
> >>>  There is a tradition of creating shortcuts that look something like
> @Daily
> >>>  or @Weekdays or @Tuesdays that we could also use to make things
> easier for
> >>>  users.
> >>>
> >>>  I have used Quartz with cron expressions in the past and there was
> some
> >>>  way
> >>>  to handle holidays with that. I think you could create a custom
> calendar
> >>>  for the holidays and call it something; aka @USHolidays. And then you
> >>>  would say "every Tuesday" except @USHolidays or something like that.
> I'd
> >>>  have to look into this some more.
> >>>
> >>>  And there are also nice online Cron expression "translators" that we
> could
> >>>  mimic in a Metron user interface. For example, https://crontab.guru.
> >>>
> >>>  On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <cestella@gmail.com>
> >>>  wrote:
> >>>
> >>>  > I actually did consider cron initially but dismissed it for the
> >>>  following
> >>>  > reasons:
> >>>  >
> >>>  > - Cron syntax allows you to construct only absolute lookbacks (i.e.
> >>>  > "every tuesday at 3PM" not "every tuesday at the current hour")
> >>>  > - Cron syntax allows you to specify a point in time, not a
> >>>  duration. We
> >>>  > could, of course, specify a duration as another argument
> >>>  > - Cron syntax does not allow you to skip things like holidays, etc.
> >>>  >
> >>>  > You could use Cron syntax as part of a broader API to specify the
> days
> >>>  to
> >>>  > look back and have other arguments handle the aspects that cron
> doesn't
> >>>  > support out of the box. I share your concern at making another DSL,
> but
> >>>  > cron seemed to not be a complete solution and it's syntax, despite
> being
> >>>  > well known by admins, may not be well known to analysts. Also, and
> >>>  this is
> >>>  > just a personal bias, I find it inscrutable without a fair amount
of
> >>>  > wikipedia and man page reading.
> >>>  >
> >>>  > On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <nick@nickallen.org>
> >>>  wrote:
> >>>  >
> >>>  > > I do prefer the flexibility of the DSL, but would prefer not
to
> create
> >>>  > yet
> >>>  > > another DSL for our users to learn. Couldn't we somehow use cron
> >>>  > > expressions for this functionality?
> >>>  > >
> >>>  > > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <cestella@gmail.com
> >
> >>>  > wrote:
> >>>  > >
> >>>  > > > Hi All,
> >>>  > > >
> >>>  > > > I'm planning to expand the capabilities of PROFILE_GET and
> wanted to
> >>>  > pass
> >>>  > > > an idea past the community.
> >>>  > > >
> >>>  > > > *Current State*
> >>>  > > >
> >>>  > > > Currently, the functionality of PROFILE_GET is fairly
> >>>  straightforward:
> >>>  > > >
> >>>  > > > - profile - The name of the profile.
> >>>  > > > - entity - The name of the entity.
> >>>  > > > - durationAgo - How long ago should values be retrieved
from?
> >>>  > > > - units - The units of 'durationAgo'.
> >>>  > > > - groups_list - Optional, must correspond to the 'groupBy'
list
> >>>  used
> >>>  > > in
> >>>  > > > profile creation - List (in square brackets) of groupBy
values
> >>>  used
> >>>  > to
> >>>  > > > filter the profile. Default is the empty list, meaning groupBy
> >>>  was
> >>>  > not
> >>>  > > > used
> >>>  > > > when creating the profile.
> >>>  > > > - config_overrides - Optional - Map (in curly braces) of
> >>>  name:value
> >>>  > > > pairs, each overriding the global config parameter of the
same
> >>>  name.
> >>>  > > > Default is the empty Map, meaning no overrides.
> >>>  > > >
> >>>  > > > This has the advantage of providing a relatively simple
> mechanism to
> >>>  > > > support the dominant use-case, gathering the profiles for
a
> trailing
> >>>  > > > window. The issues, however, are a couple:
> >>>  > > >
> >>>  > > > - We may need more complex semantics for specifying the
window
> >>>  > > > (motivated below)
> >>>  > > > - As such, this couples the gathering of the profiles with
the
> >>>  > > > specification of the window.
> >>>  > > >
> >>>  > > > I propose to decouple these two concepts. I propose that
we
> extract
> >>>  the
> >>>  > > > notion of the lookback into a separate, more featureful
function
> >>>  called
> >>>  > > > PROFILE_LOOKBACK() which could be composed with an adjusted
> >>>  > PROFILE_GET,
> >>>  > > > whose arguments look like:
> >>>  > > >
> >>>  > > >
> >>>  > > > - profile - The name of the profile.
> >>>  > > > - entity - The name of the entity.
> >>>  > > > - timestamps - The list of timestamps to retrieve
> >>>  > > > - groups_list - Optional, must correspond to the 'groupBy'
list
> >>>  used
> >>>  > > in
> >>>  > > > profile creation - List (in square brackets) of groupBy
values
> >>>  used
> >>>  > to
> >>>  > > > filter the profile. Default is the empty list, meaning groupBy
> >>>  was
> >>>  > not
> >>>  > > > used
> >>>  > > > when creating the profile.
> >>>  > > > - config_overrides - Optional - Map (in curly braces) of
> >>>  name:value
> >>>  > > > pairs, each overriding the global config parameter of the
same
> >>>  name.
> >>>  > > > Default is the empty Map, meaning no overrides.
> >>>  > > >
> >>>  > > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK
> passed to
> >>>  it
> >>>  > as
> >>>  > > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
> >>>  > > > PROFILE_LOOKBACK(...)) ).
> >>>  > > >
> >>>  > > > *Motivation for Change*
> >>>  > > >
> >>>  > > > The justification for this is that sometimes you want to
compare
> >>>  time
> >>>  > > bins
> >>>  > > > for a long duration back, but you don't want to skew the
data by
> >>>  > > including
> >>>  > > > periods that aren't distributionally similar (due to seasonal
> data,
> >>>  for
> >>>  > > > instance). You might want to compare a value to statistically
> >>>  baseline
> >>>  > > of
> >>>  > > > the median of the values for the same time window on the
same
> day
> >>>  for
> >>>  > the
> >>>  > > > last month (e.g. every tuesday at this time).
> >>>  > > >
> >>>  > > > Also, we might want a trailing window that does not start
at the
> >>>  > current
> >>>  > > > time (in wall-clock), but rather starts an hour back or
from the
> >>>  time
> >>>  > > that
> >>>  > > > the data was originally ingested.
> >>>  > > >
> >>>  > > >
> >>>  > > > *PROFILE_LOOKBACK*
> >>>  > > >
> >>>  > > > I propose that we support the following features:
> >>>  > > >
> >>>  > > > - A starting point that is not current time
> >>>  > > > - Sparse bins (i.e. the last hour for every tuesday for
the last
> >>>  > > month)
> >>>  > > > - The ability to skip events (e.g. weekends, holidays)
> >>>  > > >
> >>>  > > >
> >>>  > > > This would result in a new function with the following
> arguments:
> >>>  > > >
> >>>  > > > -
> >>>  > > >
> >>>  > > > from - The lookback starting point (default to now)
> >>>  > > > -
> >>>  > > >
> >>>  > > > fromUnits - The units for the lookback starting point
> >>>  > > > -
> >>>  > > >
> >>>  > > > to - The ending point for the lookback window (default to
from +
> >>>  > > > binSize)
> >>>  > > > -
> >>>  > > >
> >>>  > > > toUnits - The units for the lookback ending point
> >>>  > > > -
> >>>  > > >
> >>>  > > > including - A list of conditions which we would skip.
> >>>  > > > - weekend
> >>>  > > > - holiday
> >>>  > > > - sunday through saturday
> >>>  > > > -
> >>>  > > >
> >>>  > > > excluding - A list of conditions which we would skip.
> >>>  > > > - weekend
> >>>  > > > - holiday
> >>>  > > > - sunday through saturday
> >>>  > > > -
> >>>  > > >
> >>>  > > > binSize - The size of the lookback bin
> >>>  > > > -
> >>>  > > >
> >>>  > > > binUnits - The units of the lookback bin
> >>>  > > >
> >>>  > > > Given the number of arguments and their complexity and the
fact
> that
> >>>  > > many,
> >>>  > > > many are optional, I propose that either
> >>>  > > >
> >>>  > > > - PROFILE_LOOKBACK take a Map so that we can get essentially
> >>>  named
> >>>  > > > params in stellar.
> >>>  > > > - PROFILE_LOOKBACK accept a string backed by a DSL to express
> >>>  these
> >>>  > > > criteria
> >>>  > > >
> >>>  > > >
> >>>  > > > Ok, so that's a lot to take in. How about we look at some
> >>>  motivating
> >>>  > > > use-cases.
> >>>  > > >
> >>>  > > > *Base Case: A lookback of 1 hour ago*
> >>>  > > >
> >>>  > > > As a map, this would look like:
> >>>  > > >
> >>>  > > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS'
} )
> >>>  > > >
> >>>  > > > As a DSL this would look like:
> >>>  > > > PROFILE_LOOKBACK( '1 hour bins from now')
> >>>  > > >
> >>>  > > >
> >>>  > > > *The same time window every tuesday for the last month starting
> one
> >>>  > hour
> >>>  > > > ago*
> >>>  > > >
> >>>  > > > Just to make this as clear as possible, if this is run at
3PM on
> >>>  Monday
> >>>  > > > January 23rd, 2017, it would include the following bins:
> >>>  > > >
> >>>  > > > - January 17th, 2PM - 3PM
> >>>  > > > - January 10th, 2PM - 3PM
> >>>  > > > - January 3rd, 2PM - 3PM
> >>>  > > > - December 27th, 2PM - 3PM
> >>>  > > >
> >>>  > > > As a map, this would look like:
> >>>  > > >
> >>>  > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to'
: 1,
> >>>  > > 'toUnits'
> >>>  > > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1,
> 'binUnits' :
> >>>  > > 'HOURS'
> >>>  > > > } )
> >>>  > > >
> >>>  > > > As a DSL this would look like:
> >>>  > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
> >>>  > > tuesdays')
> >>>  > > >
> >>>  > > > *The same time window every sunday for the last month starting
> one
> >>>  hour
> >>>  > > ago
> >>>  > > > skipping holidays*
> >>>  > > >
> >>>  > > > Just to make this as clear as possible, if this is run at
3PM on
> >>>  Monday
> >>>  > > > January 22rd, 2017, it would include the following bins:
> >>>  > > >
> >>>  > > > - January 16th, 2PM - 3PM
> >>>  > > > - January 9th, 2PM - 3PM
> >>>  > > > - January 2rd, 2PM - 3PM
> >>>  > > > - NOT December 25th
> >>>  > > >
> >>>  > > > As a map, this would look like:
> >>>  > > >
> >>>  > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to'
: 1,
> >>>  > > 'toUnits'
> >>>  > > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [
> 'holidays' ],
> >>>  > > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
> >>>  > > >
> >>>  > > > As a DSL this would look like:
> >>>  > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
> >>>  > tuesdays
> >>>  > > > excluding holidays')
> >>>  > > >
> >>>  > > > *DSL vs API*
> >>>  > > >
> >>>  > > > So, here's my personal rundown of the two approaches:
> >>>  > > >
> >>>  > > > DSL:
> >>>  > > >
> >>>  > > > - PRO
> >>>  > > > - Clear. As you can see, it reads like a sentence
> >>>  > > > - Concise
> >>>  > > > - CON:
> >>>  > > > - More complex to implement
> >>>  > > > - Another DSL to learn
> >>>  > > >
> >>>  > > > API:
> >>>  > > >
> >>>  > > > - PRO
> >>>  > > > - Simpler to implement (though marginally so, IMO)
> >>>  > > > - CON
> >>>  > > > - A bit more complex to understand (also, IMO)
> >>>  > > >
> >>>  > > > I'd like to solicit feedback from the community at this
point:
> >>>  > > >
> >>>  > > > - What do you think of this change?
> >>>  > > > - Would you prefer the DSL, API or other approach?
> >>>  > > >
> >>>  > > > Thanks,
> >>>  > > >
> >>>  > > > Casey
> >>>  > > >
> >>>  > >
> >>>  > >
> >>>  > >
> >>>  > > --
> >>>  > > Nick Allen <nick@nickallen.org>
> >>>  > >
> >>>  >
> >>>
> >>>  --
> >>>  Nick Allen <nick@nickallen.org>
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>



-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message