metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Miklavcic <michael.miklav...@gmail.com>
Subject Re: [DISCUSS] How to do Sliding Windows in Profiler
Date Wed, 25 Jan 2017 19:06:45 GMT
Hey guys,

1. I'm going to Google that brandy analogy later. It sounds tasty :)
2. The option for reading/writing multiple profiles makes sense to me and
provides much greater flexibility and efficiency.
3. My gut reaction to reading through this is "wow." This seems really
complicated. I'm all for the flexibility provided by these composable
functions and think this would be much better for users with either some
templates or other form of abstraction to simplify common patterns.

Mike


On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella <cestella@gmail.com> wrote:

> So, I think the key takeaway, at least for me, and my key feedback is that:
>
>    - We should separate the wrong behavior of MAD with the ability to
>    improve windowing.  We can fix one without the other and, in my opinion,
>    the key fix for MAD is keeping the value distribution independent of the
>    MAD state, as Matt noted at the end of this email.
>    - Because the above necessitates tracking two things, it'd be nice to
>    adopt Matt's suggestion to be able to sort of merge profiles track two
>    results.
>
> One more thing, in general, if you ever want to merge outputs on read, then
> you should never write out the merged data (unless the data you're writing
> has a way to untangle the overlap).  They are two separate semantics.
>
> On merge on read, you choose the window at read time.  An example of this
> for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function which
> takes the value distribution would be:
> {
>   "profiles": [
>     {
>       "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
>       "foreach": "'global'",
>       "init" : {
>         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
>         "val_dist": "STATS_INIT()"
>         },
>       "update": {
>         "val_dist": "STATS_ADD(val_dist, value)",
>         "s": "OUTLIER_MAD_ADD(s, value)"
>         },
>       "result": {
>         "sketchy_values": "val_dist",
>         "sketchy_mad": "s",
>         }
>     }
>   ]
> }
>
> You would then read this in the enrichment topology or wherever
> via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> 'global', 5, 'MINUTES')), value)
>
>
> On merge on write, you choose the window size on write time and should read
> only the latest profile entry on read, so that would look like:
> {
>   "profiles": [
>     {
>       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
>       "foreach": "'global'",
>       "init" : {
>         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')))",
>         "val_dist": "STATS_INIT()"
>         },
>       "update": {
>         "val_dist": "STATS_ADD(val_dist, value)",
>         "s": "OUTLIER_MAD_ADD(s, value)"
>         },
>       "result": {
>         "sketchy_values": "val_dist",
>         "sketchy_mad": "s",
>         "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
>         }
>     }
>   ]
> }
>
>
> You would then score pulling only the most recent MAD state.  Merging
> across them would double count entries, as Matt mentioned above
> OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
> 'MINUTES')), value)
>
> So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
> multiple outputs.  Also, different but coupled problem, MAD needs fixing to
> decouple tracking value distributions from the distribution of the
> deviation from the median.
>
> Casey
>
> On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <nick@nickallen.org> wrote:
>
> > That was a lot to digest so I apologize if I have missed some of your
> > thought process.  I probably need to read through this a couple more
> times.
> >
> >
> > Can anyone specify a better way to do Sliding Window profiles correctly
> > > with current functionality?
> >
> >
> > Could we not change the underlying implementation of the profiler to
> > support sliding windows?  Ideally the same profile definition could be
> > applied to either a tumbling or sliding window, rather than requiring a
> > change in the profile definition itself.
> >
> > The work that I had done for METRON-590, made it possible to configure
> > either a tumbling or sliding window.  The downside of that work is that
> it
> > was an all-or-nothing change, all profiles (in the same topology) were
> > either tumbling or sliding.  But perhaps we can enhance it from there.
> >
> >
> > A much more elegant solution would be to allow “result” to write two
> > > objects.
> >
> >
> > Yes, I like this! I have had the same thought myself.  This would be a
> > simple change and would provide the user with some flexibility.  Does
> this
> > change solve the entire problem though?
> >
> >
> > Obviously, I do need to sit down and think on this a bit more.  I really
> > wish we could handle these use cases with a profile definition that
> remains
> > simple and easy to grok.  Thanks for spending the time to think through
> > this.
> >
> > On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <mattf@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > Casey and I had an interesting chat yesterday, during which we agreed
> > that
> > > the example code for Outlier Analysis in https://github.com/apache/
> > > incubator-metron/blob/master/metron-analytics/metron-
> > statistics/README.md
> > > and the revised example code in https://issues.apache.org/
> > > jira/browse/METRON-668 (as of 23 January) both do not correctly
> implement
> > > the desired Sliding Window model.  This email gives the argument for
> why,
> > > and proposes a couple ways to do it right.  Your input and preferences
> > are
> > > requested.
> > >
> > >
> > >
> > > First a couple statements about the STATS object that underlies most
> > > interesting Profile work:
> > >
> > > ·         The STATS object is a t-digest.  It summarizes a set of data
> > > points, such as those received during a sampling period, in a way that
> is
> > > nominally O(1) regardless of the input number of data points, and
> > preserves
> > > the info about the “shape” of the statistical distribution of those
> > > points.  Not only info about averages and standard deviations, but also
> > > about medians and percentiles (which, btw, is a very different kind of
> > > information), is preserved and can be calculated correctly to within a
> > > given error epsilon.  Since it is a summary, however, time information
> is
> > > lost.
> > >
> > > ·         STATS objects, these digests of sampling periods, are
> > MERGEABLE,
> > > meaning if you have a digest from time(1) to time(2), and another
> digest
> > > from time(2) to time (3), you can merge them and get a digest that is
> > > statistically equivalent to a digest from time(1) to time(3)
> > continuously.
> > >
> > > ·         They are sort of idempotent, in that if you take two
> identical
> > > digests and merge them, you get almost the same object.  However, the
> > > result object will be scaled as summarizing twice the number of input
> > data
> > > points.
> > >
> > > ·         Which is why it DOESN’T work to try to merge overlapping
> > > sampling periods.  To give a crude example, if you have a digest from
> > > time(1) to time(3) and another digest from time(2) to time(4), and
> merge
> > > them, the samples from time(2) to time(3) will be over-represented by a
> > > factor of 2x, which should be expected to skew the distribution (unless
> > the
> > > distribution really is constant for all sub-windows – which would mean
> we
> > > don’t need Sliding Windows because nothing changes).
> > >
> > >
> > >
> > > The Outlier Analysis profiles linked above try to implement a sliding
> > > window, in which each profile period summarizes the Median Absolute
> > > Deviation distribution of the last five profile periods only.  An
> > “Outlier
> > > MAD Score” can then be determined by comparing the deviation of a new
> > data
> > > point to the last MAD distribution recorded in the Profile.  This
> allows
> > > for changes over time in the statistical distribution of inputs, but
> does
> > > not make the measure unduly sensitive to just the last minute or two.
> > This
> > > is a typical use case for Sliding Windows.
> > >
> > >
> > >
> > > Both example codes trip on how to do the sliding window in the context
> of
> > > a Profile.  At sampling period boundaries, both either compose the
> > “result”
> > > digest or initialize the “next” digest by reading the previous 5 result
> > > digests.  That is wrong, because it ignores the fact that those digests
> > > aren’t just for their time periods.  They too were composed with THEIR
> > > preceding 5 digests, each of which were composed with their preceding 5
> > > digests, which in turn… etc.  The end result is sort of like the way
> > > Madieras or some brandies are aged via the Solera process with
> fractional
> > > blending.  You don’t get a true sliding window, which sharply drops the
> > > past, you get a continuous dilution of the past. In fact, it’s wrong to
> > > assume that the “tail” of the far past is more diluted than the near
> > past!
> > > – It would be with some algorithms, but the alg used in these two
> > examples
> > > causes the far past to become an exponentially MORE important fraction
> of
> > > the overall data than the near past – much worse than simply turning on
> > > digesting at time(0) and leaving it on, with no attempt at windowing.
> > > (Simulate it in a spreadsheet and you’ll see.)
> > >
> > >
> > >
> > > We need a Profiler structure that assists in creating Sliding Window
> > > profiles.  The problem is that Profiles let you save only one thing
> > > (quantity or object) per sampling period, and that’s typically a
> > different
> > > “thing” (object type or scale) than you want to use to compose the
> result
> > > for each windowed span.  One way to do it correctly would be with two
> > > Profiles, like this:
> > >
> > >
> > >
> > > (SOLUTION A)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "sketchy_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": "s"
> > >
> > >     },
> > >
> > >     {
> > >
> > >       "profile": "windowed_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : { },
> > >
> > >       "update": { },
> > >
> > >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 5, 'MINUTES'))"
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This is typical.  You have a fine-grain sampling period that you want
> to
> > > “tumble”, and a broader window that you want to “slide” or “roll”
along
> > the
> > > underlying smaller sampling periods.
> > >
> > > The above pair of profiles is clear enough, but not real efficient.
> > Also,
> > > you can’t guarantee they’ll stay exactly synchronized.  It would be
> > better
> > > to combine them somehow.
> > >
> > >
> > >
> > > In these examples, we only need to look back 5 sampling periods, so the
> > > following would work (NOTE: requires Casey’s PR#668 that preserves
> state
> > > between sampling periods):
> > >
> > >
> > >
> > > (SOLUTION B)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "windowed_mad",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> > >
> > >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> > >
> > >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> > >
> > >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> > >
> > >         "s0": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> > >
> > >         },
> > >
> > >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This works fine for a sliding window of size 5 sample periods.  But
> what
> > > if it was 15?  Or 60?
> > >
> > > No doubt we could implement a ring buffer with a stellar list object,
> but
> > > it’s gonna get ugly(er) and slow.
> > >
> > >
> > >
> > > A much more elegant solution would be to allow “result” to write two
> > > objects.  The result could be viewed either as two Profiles, or as a
> > > Profile with multiple addressable columns of hbase output.  For
> instance:
> > >
> > > (SOLUTION C)
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "s": "OUTLIER_MAD_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": {
> > >
> > >         "sketchy_mad": "s",
> > >
> > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 4, 'MINUTES'))"
> > >
> > >         }
> > >
> > >     }
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > This is clean and tight, and I think will be easily generalized to any
> > > Sliding Window scenario.
> > >
> > > BTW, given that each write has to work it’s way through more processing
> > > before getting to HBase, it’s best to say that all writes happen AFTER
> > all
> > > the “result” processing, which is why I showed the MERGE above
> retrieving
> > > only 4 sampling periods prior to the current one, and then add the
> > current
> > > one from local state.
> > >
> > >
> > >
> > > What do you think?  Can anyone specify a better way to do Sliding
> Window
> > > profiles correctly with current functionality?
> > >
> > >
> > >
> > >
> > >
> > > Casey noted that if this feature were available, he would like to
> > consider
> > > separating the value distribution from the deviation distribution, in
> the
> > > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > > OUTLIER_MAD_INIT() function needs to take a starting median, as
> argument.
> > > So the final Sliding Window Profile for OUTLIER_MAD state tracking
> might
> > > look more like:
> > >
> > >
> > >
> > > {
> > >
> > >   "profiles": [
> > >
> > >     {
> > >
> > >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> > >
> > >       "foreach": "'global'",
> > >
> > >       "init" : {
> > >
> > >         "median": "STATS_PERCENTILE(STATS_MERGE(
> > PROFILE_GET('sketchy_values',
> > > 'global', 5, 'MINUTES')), 50)",
> > >
> > >         "s": "OUTLIER_MAD_INIT(median)",
> > >
> > >         "val_dist": "STATS_INIT()"
> > >
> > >         },
> > >
> > >       "update": {
> > >
> > >         "val_dist": "STATS_ADD(val_dist, value)",
> > >
> > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > >
> > >         },
> > >
> > >       "result": {
> > >
> > >         "sketchy_values": "val_dist",
> > >
> > >         "sketchy_mad": "s",
> > >
> > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > PROFILE_GET('sketchy_mad',
> > >
> > > 'global', 4, 'MINUTES'))"
> > >
> > >         }
> > >
> > >     },
> > >
> > >   ]
> > >
> > > }
> > >
> > >
> > >
> > > Thanks,
> > >
> > > --Matt
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Nick Allen <nick@nickallen.org>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message