metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Sirota <jsir...@apache.org>
Subject Re: [DISCUSS] Expansion of the capabilities of PROFILE_GET
Date Tue, 31 Jan 2017 19:07:57 GMT
I am not sure we need to incorporate the cron.  Whatever scheduler we use to trigger the profile_get/lookup
should have the cron function built in.  I think that's out of scope for us.  I think what
you defined in your original post at the start of this thread is exactly what we need

31.01.2017, 12:02, "Casey Stella" <cestella@gmail.com>:
> One more point, one of the reasons for decoupling the PROFILE_GET from
> PROFILE_LOOKUP means that we could ahve alternative implementations of
> PROFILE_LOOKUP. We could have a PROFILE_LOOKUP_CRON as well.
>
> On Tue, Jan 31, 2017 at 1:43 PM, Casey Stella <cestella@gmail.com> wrote:
>
>>  Regarding the "?" syntax:
>>  Wouldn't that be forking cron syntax so now we have a metron cron? If
>>  we're constructing our own syntax, then why not do it so that it reads like
>>  natural language?
>>
>>  Regarding the holiday problem:
>>  Agreed, it's a smaller problem than constructing a DSL, but that's not
>>  really the point, I think. The concern is that it would be unable to be
>>  expressed using cron syntax in a natural way without modifying cron syntax,
>>  which would be constructing a new DSL. If quartz has a clever way of doing
>>  that, then I'd like to see it. From a quick search, I haven't seen a
>>  scheduling example with a compact syntax that shows skipping holidays with
>>  cron syntax.
>>
>>  On Tue, Jan 31, 2017 at 1:29 PM, Nick Allen <nick@nickallen.org> wrote:
>>
>>>  >
>>>  > - Cron syntax allows you to construct only absolute lookbacks (i.e.
>>>  > "every tuesday at 3PM" not "every tuesday at the current hour")
>>>
>>>  I think Cron would work for this. I am no expert on cron expressions, but
>>>  I think the following examples would work.
>>>
>>>     - If you want "every Tuesday at 3 PM"
>>>        - 0 0 15 ? * TUE *
>>>     - If you want "every Tuesday at current hour" then use something like
>>>     the "?" placeholder maybe.
>>>        - 0 0 ? ? * TUE *
>>>
>>>  - Cron syntax allows you to specify a point in time, not a duration. We
>>>  > could, of course, specify a duration as another argument
>>>
>>>  Yes, a separate argument would be necessary. We would have to allow the
>>>  user to specify either a "start from date/time" or the "number of
>>>  intervals
>>>  to look back".
>>>
>>>  Cron syntax does not allow you to skip things like holidays, etc.
>>>
>>>  I agree, out-of-the-box Cron does not solve holiday calendars. But this
>>>  would be a smaller problem to solve then creating our own DSL.
>>>
>>>  There is a tradition of creating shortcuts that look something like @Daily
>>>  or @Weekdays or @Tuesdays that we could also use to make things easier for
>>>  users.
>>>
>>>  I have used Quartz with cron expressions in the past and there was some
>>>  way
>>>  to handle holidays with that. I think you could create a custom calendar
>>>  for the holidays and call it something; aka @USHolidays. And then you
>>>  would say "every Tuesday" except @USHolidays or something like that. I'd
>>>  have to look into this some more.
>>>
>>>  And there are also nice online Cron expression "translators" that we could
>>>  mimic in a Metron user interface. For example, https://crontab.guru.
>>>
>>>  On Tue, Jan 31, 2017 at 12:00 PM, Casey Stella <cestella@gmail.com>
>>>  wrote:
>>>
>>>  > I actually did consider cron initially but dismissed it for the
>>>  following
>>>  > reasons:
>>>  >
>>>  > - Cron syntax allows you to construct only absolute lookbacks (i.e.
>>>  > "every tuesday at 3PM" not "every tuesday at the current hour")
>>>  > - Cron syntax allows you to specify a point in time, not a
>>>  duration. We
>>>  > could, of course, specify a duration as another argument
>>>  > - Cron syntax does not allow you to skip things like holidays, etc.
>>>  >
>>>  > You could use Cron syntax as part of a broader API to specify the days
>>>  to
>>>  > look back and have other arguments handle the aspects that cron doesn't
>>>  > support out of the box. I share your concern at making another DSL, but
>>>  > cron seemed to not be a complete solution and it's syntax, despite being
>>>  > well known by admins, may not be well known to analysts. Also, and
>>>  this is
>>>  > just a personal bias, I find it inscrutable without a fair amount of
>>>  > wikipedia and man page reading.
>>>  >
>>>  > On Tue, Jan 31, 2017 at 11:47 AM, Nick Allen <nick@nickallen.org>
>>>  wrote:
>>>  >
>>>  > > I do prefer the flexibility of the DSL, but would prefer not to create
>>>  > yet
>>>  > > another DSL for our users to learn. Couldn't we somehow use cron
>>>  > > expressions for this functionality?
>>>  > >
>>>  > > On Mon, Jan 23, 2017 at 3:01 PM, Casey Stella <cestella@gmail.com>
>>>  > wrote:
>>>  > >
>>>  > > > Hi All,
>>>  > > >
>>>  > > > I'm planning to expand the capabilities of PROFILE_GET and wanted
to
>>>  > pass
>>>  > > > an idea past the community.
>>>  > > >
>>>  > > > *Current State*
>>>  > > >
>>>  > > > Currently, the functionality of PROFILE_GET is fairly
>>>  straightforward:
>>>  > > >
>>>  > > > - profile - The name of the profile.
>>>  > > > - entity - The name of the entity.
>>>  > > > - durationAgo - How long ago should values be retrieved from?
>>>  > > > - units - The units of 'durationAgo'.
>>>  > > > - groups_list - Optional, must correspond to the 'groupBy' list
>>>  used
>>>  > > in
>>>  > > > profile creation - List (in square brackets) of groupBy values
>>>  used
>>>  > to
>>>  > > > filter the profile. Default is the empty list, meaning groupBy
>>>  was
>>>  > not
>>>  > > > used
>>>  > > > when creating the profile.
>>>  > > > - config_overrides - Optional - Map (in curly braces) of
>>>  name:value
>>>  > > > pairs, each overriding the global config parameter of the same
>>>  name.
>>>  > > > Default is the empty Map, meaning no overrides.
>>>  > > >
>>>  > > > This has the advantage of providing a relatively simple mechanism
to
>>>  > > > support the dominant use-case, gathering the profiles for a
trailing
>>>  > > > window. The issues, however, are a couple:
>>>  > > >
>>>  > > > - We may need more complex semantics for specifying the window
>>>  > > > (motivated below)
>>>  > > > - As such, this couples the gathering of the profiles with the
>>>  > > > specification of the window.
>>>  > > >
>>>  > > > I propose to decouple these two concepts. I propose that we
extract
>>>  the
>>>  > > > notion of the lookback into a separate, more featureful function
>>>  called
>>>  > > > PROFILE_LOOKBACK() which could be composed with an adjusted
>>>  > PROFILE_GET,
>>>  > > > whose arguments look like:
>>>  > > >
>>>  > > >
>>>  > > > - profile - The name of the profile.
>>>  > > > - entity - The name of the entity.
>>>  > > > - timestamps - The list of timestamps to retrieve
>>>  > > > - groups_list - Optional, must correspond to the 'groupBy' list
>>>  used
>>>  > > in
>>>  > > > profile creation - List (in square brackets) of groupBy values
>>>  used
>>>  > to
>>>  > > > filter the profile. Default is the empty list, meaning groupBy
>>>  was
>>>  > not
>>>  > > > used
>>>  > > > when creating the profile.
>>>  > > > - config_overrides - Optional - Map (in curly braces) of
>>>  name:value
>>>  > > > pairs, each overriding the global config parameter of the same
>>>  name.
>>>  > > > Default is the empty Map, meaning no overrides.
>>>  > > >
>>>  > > > So, PROFILE_GET would have the output of PROFILE_LOOKBACK passed
to
>>>  it
>>>  > as
>>>  > > > its 3rd argument (e.g. PROFILE_GET( 'my_profile', 'my_entity',
>>>  > > > PROFILE_LOOKBACK(...)) ).
>>>  > > >
>>>  > > > *Motivation for Change*
>>>  > > >
>>>  > > > The justification for this is that sometimes you want to compare
>>>  time
>>>  > > bins
>>>  > > > for a long duration back, but you don't want to skew the data
by
>>>  > > including
>>>  > > > periods that aren't distributionally similar (due to seasonal
data,
>>>  for
>>>  > > > instance). You might want to compare a value to statistically
>>>  baseline
>>>  > > of
>>>  > > > the median of the values for the same time window on the same
day
>>>  for
>>>  > the
>>>  > > > last month (e.g. every tuesday at this time).
>>>  > > >
>>>  > > > Also, we might want a trailing window that does not start at
the
>>>  > current
>>>  > > > time (in wall-clock), but rather starts an hour back or from
the
>>>  time
>>>  > > that
>>>  > > > the data was originally ingested.
>>>  > > >
>>>  > > >
>>>  > > > *PROFILE_LOOKBACK*
>>>  > > >
>>>  > > > I propose that we support the following features:
>>>  > > >
>>>  > > > - A starting point that is not current time
>>>  > > > - Sparse bins (i.e. the last hour for every tuesday for the
last
>>>  > > month)
>>>  > > > - The ability to skip events (e.g. weekends, holidays)
>>>  > > >
>>>  > > >
>>>  > > > This would result in a new function with the following arguments:
>>>  > > >
>>>  > > > -
>>>  > > >
>>>  > > > from - The lookback starting point (default to now)
>>>  > > > -
>>>  > > >
>>>  > > > fromUnits - The units for the lookback starting point
>>>  > > > -
>>>  > > >
>>>  > > > to - The ending point for the lookback window (default to from
+
>>>  > > > binSize)
>>>  > > > -
>>>  > > >
>>>  > > > toUnits - The units for the lookback ending point
>>>  > > > -
>>>  > > >
>>>  > > > including - A list of conditions which we would skip.
>>>  > > > - weekend
>>>  > > > - holiday
>>>  > > > - sunday through saturday
>>>  > > > -
>>>  > > >
>>>  > > > excluding - A list of conditions which we would skip.
>>>  > > > - weekend
>>>  > > > - holiday
>>>  > > > - sunday through saturday
>>>  > > > -
>>>  > > >
>>>  > > > binSize - The size of the lookback bin
>>>  > > > -
>>>  > > >
>>>  > > > binUnits - The units of the lookback bin
>>>  > > >
>>>  > > > Given the number of arguments and their complexity and the fact
that
>>>  > > many,
>>>  > > > many are optional, I propose that either
>>>  > > >
>>>  > > > - PROFILE_LOOKBACK take a Map so that we can get essentially
>>>  named
>>>  > > > params in stellar.
>>>  > > > - PROFILE_LOOKBACK accept a string backed by a DSL to express
>>>  these
>>>  > > > criteria
>>>  > > >
>>>  > > >
>>>  > > > Ok, so that's a lot to take in. How about we look at some
>>>  motivating
>>>  > > > use-cases.
>>>  > > >
>>>  > > > *Base Case: A lookback of 1 hour ago*
>>>  > > >
>>>  > > > As a map, this would look like:
>>>  > > >
>>>  > > > PROFILE_LOOKBACK( { 'binSize' : 1, 'binUnits' : 'HOURS' } )
>>>  > > >
>>>  > > > As a DSL this would look like:
>>>  > > > PROFILE_LOOKBACK( '1 hour bins from now')
>>>  > > >
>>>  > > >
>>>  > > > *The same time window every tuesday for the last month starting
one
>>>  > hour
>>>  > > > ago*
>>>  > > >
>>>  > > > Just to make this as clear as possible, if this is run at 3PM
on
>>>  Monday
>>>  > > > January 23rd, 2017, it would include the following bins:
>>>  > > >
>>>  > > > - January 17th, 2PM - 3PM
>>>  > > > - January 10th, 2PM - 3PM
>>>  > > > - January 3rd, 2PM - 3PM
>>>  > > > - December 27th, 2PM - 3PM
>>>  > > >
>>>  > > > As a map, this would look like:
>>>  > > >
>>>  > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to'
: 1,
>>>  > > 'toUnits'
>>>  > > > : 'MONTH', 'including' : [ 'tuesday' ], 'binSize' : 1, 'binUnits'
:
>>>  > > 'HOURS'
>>>  > > > } )
>>>  > > >
>>>  > > > As a DSL this would look like:
>>>  > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>>>  > > tuesdays')
>>>  > > >
>>>  > > > *The same time window every sunday for the last month starting
one
>>>  hour
>>>  > > ago
>>>  > > > skipping holidays*
>>>  > > >
>>>  > > > Just to make this as clear as possible, if this is run at 3PM
on
>>>  Monday
>>>  > > > January 22rd, 2017, it would include the following bins:
>>>  > > >
>>>  > > > - January 16th, 2PM - 3PM
>>>  > > > - January 9th, 2PM - 3PM
>>>  > > > - January 2rd, 2PM - 3PM
>>>  > > > - NOT December 25th
>>>  > > >
>>>  > > > As a map, this would look like:
>>>  > > >
>>>  > > > PROFILE_LOOKBACK( { 'from' : 1, 'fromUnits' : 'HOURS', 'to'
: 1,
>>>  > > 'toUnits'
>>>  > > > : 'MONTH', 'including' : [ 'tuesday'], 'excluding' : [ 'holidays'
],
>>>  > > > 'binSize' : 1, 'binUnits' : 'HOURS' } )
>>>  > > >
>>>  > > > As a DSL this would look like:
>>>  > > > PROFILE_LOOKBACK( '1 hour bins from 1 hour to 1 month including
>>>  > tuesdays
>>>  > > > excluding holidays')
>>>  > > >
>>>  > > > *DSL vs API*
>>>  > > >
>>>  > > > So, here's my personal rundown of the two approaches:
>>>  > > >
>>>  > > > DSL:
>>>  > > >
>>>  > > > - PRO
>>>  > > > - Clear. As you can see, it reads like a sentence
>>>  > > > - Concise
>>>  > > > - CON:
>>>  > > > - More complex to implement
>>>  > > > - Another DSL to learn
>>>  > > >
>>>  > > > API:
>>>  > > >
>>>  > > > - PRO
>>>  > > > - Simpler to implement (though marginally so, IMO)
>>>  > > > - CON
>>>  > > > - A bit more complex to understand (also, IMO)
>>>  > > >
>>>  > > > I'd like to solicit feedback from the community at this point:
>>>  > > >
>>>  > > > - What do you think of this change?
>>>  > > > - Would you prefer the DSL, API or other approach?
>>>  > > >
>>>  > > > Thanks,
>>>  > > >
>>>  > > > Casey
>>>  > > >
>>>  > >
>>>  > >
>>>  > >
>>>  > > --
>>>  > > Nick Allen <nick@nickallen.org>
>>>  > >
>>>  >
>>>
>>>  --
>>>  Nick Allen <nick@nickallen.org>

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Mime
View raw message