metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] How to do Sliding Windows in Profiler
Date Wed, 25 Jan 2017 14:28:51 GMT
That was a lot to digest so I apologize if I have missed some of your
thought process.  I probably need to read through this a couple more times.


Can anyone specify a better way to do Sliding Window profiles correctly
> with current functionality?


Could we not change the underlying implementation of the profiler to
support sliding windows?  Ideally the same profile definition could be
applied to either a tumbling or sliding window, rather than requiring a
change in the profile definition itself.

The work that I had done for METRON-590, made it possible to configure
either a tumbling or sliding window.  The downside of that work is that it
was an all-or-nothing change, all profiles (in the same topology) were
either tumbling or sliding.  But perhaps we can enhance it from there.


A much more elegant solution would be to allow “result” to write two
> objects.


Yes, I like this! I have had the same thought myself.  This would be a
simple change and would provide the user with some flexibility.  Does this
change solve the entire problem though?


Obviously, I do need to sit down and think on this a bit more.  I really
wish we could handle these use cases with a profile definition that remains
simple and easy to grok.  Thanks for spending the time to think through
this.

On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <mattf@apache.org> wrote:

> Hi all,
>
> Casey and I had an interesting chat yesterday, during which we agreed that
> the example code for Outlier Analysis in https://github.com/apache/
> incubator-metron/blob/master/metron-analytics/metron-statistics/README.md
> and the revised example code in https://issues.apache.org/
> jira/browse/METRON-668 (as of 23 January) both do not correctly implement
> the desired Sliding Window model.  This email gives the argument for why,
> and proposes a couple ways to do it right.  Your input and preferences are
> requested.
>
>
>
> First a couple statements about the STATS object that underlies most
> interesting Profile work:
>
> ·         The STATS object is a t-digest.  It summarizes a set of data
> points, such as those received during a sampling period, in a way that is
> nominally O(1) regardless of the input number of data points, and preserves
> the info about the “shape” of the statistical distribution of those
> points.  Not only info about averages and standard deviations, but also
> about medians and percentiles (which, btw, is a very different kind of
> information), is preserved and can be calculated correctly to within a
> given error epsilon.  Since it is a summary, however, time information is
> lost.
>
> ·         STATS objects, these digests of sampling periods, are MERGEABLE,
> meaning if you have a digest from time(1) to time(2), and another digest
> from time(2) to time (3), you can merge them and get a digest that is
> statistically equivalent to a digest from time(1) to time(3) continuously.
>
> ·         They are sort of idempotent, in that if you take two identical
> digests and merge them, you get almost the same object.  However, the
> result object will be scaled as summarizing twice the number of input data
> points.
>
> ·         Which is why it DOESN’T work to try to merge overlapping
> sampling periods.  To give a crude example, if you have a digest from
> time(1) to time(3) and another digest from time(2) to time(4), and merge
> them, the samples from time(2) to time(3) will be over-represented by a
> factor of 2x, which should be expected to skew the distribution (unless the
> distribution really is constant for all sub-windows – which would mean we
> don’t need Sliding Windows because nothing changes).
>
>
>
> The Outlier Analysis profiles linked above try to implement a sliding
> window, in which each profile period summarizes the Median Absolute
> Deviation distribution of the last five profile periods only.  An “Outlier
> MAD Score” can then be determined by comparing the deviation of a new data
> point to the last MAD distribution recorded in the Profile.  This allows
> for changes over time in the statistical distribution of inputs, but does
> not make the measure unduly sensitive to just the last minute or two.  This
> is a typical use case for Sliding Windows.
>
>
>
> Both example codes trip on how to do the sliding window in the context of
> a Profile.  At sampling period boundaries, both either compose the “result”
> digest or initialize the “next” digest by reading the previous 5 result
> digests.  That is wrong, because it ignores the fact that those digests
> aren’t just for their time periods.  They too were composed with THEIR
> preceding 5 digests, each of which were composed with their preceding 5
> digests, which in turn… etc.  The end result is sort of like the way
> Madieras or some brandies are aged via the Solera process with fractional
> blending.  You don’t get a true sliding window, which sharply drops the
> past, you get a continuous dilution of the past. In fact, it’s wrong to
> assume that the “tail” of the far past is more diluted than the near past!
> – It would be with some algorithms, but the alg used in these two examples
> causes the far past to become an exponentially MORE important fraction of
> the overall data than the near past – much worse than simply turning on
> digesting at time(0) and leaving it on, with no attempt at windowing.
> (Simulate it in a spreadsheet and you’ll see.)
>
>
>
> We need a Profiler structure that assists in creating Sliding Window
> profiles.  The problem is that Profiles let you save only one thing
> (quantity or object) per sampling period, and that’s typically a different
> “thing” (object type or scale) than you want to use to compose the result
> for each windowed span.  One way to do it correctly would be with two
> Profiles, like this:
>
>
>
> (SOLUTION A)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "sketchy_mad",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": "s"
>
>     },
>
>     {
>
>       "profile": "windowed_mad",
>
>       "foreach": "'global'",
>
>       "init" : { },
>
>       "update": { },
>
>       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
>
> 'global', 5, 'MINUTES'))"
>
>     }
>
>   ]
>
> }
>
>
>
> This is typical.  You have a fine-grain sampling period that you want to
> “tumble”, and a broader window that you want to “slide” or “roll” along the
> underlying smaller sampling periods.
>
> The above pair of profiles is clear enough, but not real efficient.  Also,
> you can’t guarantee they’ll stay exactly synchronized.  It would be better
> to combine them somehow.
>
>
>
> In these examples, we only need to look back 5 sampling periods, so the
> following would work (NOTE: requires Casey’s PR#668 that preserves state
> between sampling periods):
>
>
>
> (SOLUTION B)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "windowed_mad",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
>
>         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
>
>         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
>
>        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
>
>         "s0": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s0": "OUTLIER_MAD_ADD(s0, value)"
>
>         },
>
>       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
>
>     }
>
>   ]
>
> }
>
>
>
> This works fine for a sliding window of size 5 sample periods.  But what
> if it was 15?  Or 60?
>
> No doubt we could implement a ring buffer with a stellar list object, but
> it’s gonna get ugly(er) and slow.
>
>
>
> A much more elegant solution would be to allow “result” to write two
> objects.  The result could be viewed either as two Profiles, or as a
> Profile with multiple addressable columns of hbase output.  For instance:
>
> (SOLUTION C)
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "s": "OUTLIER_MAD_INIT()"
>
>         },
>
>       "update": {
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": {
>
>         "sketchy_mad": "s",
>
>         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad',
>
> 'global', 4, 'MINUTES'))"
>
>         }
>
>     }
>
>   ]
>
> }
>
>
>
> This is clean and tight, and I think will be easily generalized to any
> Sliding Window scenario.
>
> BTW, given that each write has to work it’s way through more processing
> before getting to HBase, it’s best to say that all writes happen AFTER all
> the “result” processing, which is why I showed the MERGE above retrieving
> only 4 sampling periods prior to the current one, and then add the current
> one from local state.
>
>
>
> What do you think?  Can anyone specify a better way to do Sliding Window
> profiles correctly with current functionality?
>
>
>
>
>
> Casey noted that if this feature were available, he would like to consider
> separating the value distribution from the deviation distribution, in the
> OUTLIER_MAD state objects.  He also notes that the hypothetical
> OUTLIER_MAD_INIT() function needs to take a starting median, as argument.
> So the final Sliding Window Profile for OUTLIER_MAD state tracking might
> look more like:
>
>
>
> {
>
>   "profiles": [
>
>     {
>
>       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
>
>       "foreach": "'global'",
>
>       "init" : {
>
>         "median": "STATS_PERCENTILE(STATS_MERGE(PROFILE_GET('sketchy_values',
> 'global', 5, 'MINUTES')), 50)",
>
>         "s": "OUTLIER_MAD_INIT(median)",
>
>         "val_dist": "STATS_INIT()"
>
>         },
>
>       "update": {
>
>         "val_dist": "STATS_ADD(val_dist, value)",
>
>         "s": "OUTLIER_MAD_ADD(s, value)"
>
>         },
>
>       "result": {
>
>         "sketchy_values": "val_dist",
>
>         "sketchy_mad": "s",
>
>         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> PROFILE_GET('sketchy_mad',
>
> 'global', 4, 'MINUTES'))"
>
>         }
>
>     },
>
>   ]
>
> }
>
>
>
> Thanks,
>
> --Matt
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message