metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject [DISCUSS] How to do Sliding Windows in Profiler
Date Wed, 25 Jan 2017 07:00:08 GMT
Hi all,

Casey and I had an interesting chat yesterday, during which we agreed that the example code
for Outlier Analysis in https://github.com/apache/incubator-metron/blob/master/metron-analytics/metron-statistics/README.md
and the revised example code in https://issues.apache.org/jira/browse/METRON-668 (as of 23
January) both do not correctly implement the desired Sliding Window model.  This email gives
the argument for why, and proposes a couple ways to do it right.  Your input and preferences
are requested.

 

First a couple statements about the STATS object that underlies most interesting Profile work:

·         The STATS object is a t-digest.  It summarizes a set of data points, such as those
received during a sampling period, in a way that is nominally O(1) regardless of the input
number of data points, and preserves the info about the “shape” of the statistical distribution
of those points.  Not only info about averages and standard deviations, but also about medians
and percentiles (which, btw, is a very different kind of information), is preserved and can
be calculated correctly to within a given error epsilon.  Since it is a summary, however,
time information is lost.

·         STATS objects, these digests of sampling periods, are MERGEABLE, meaning if you
have a digest from time(1) to time(2), and another digest from time(2) to time (3), you can
merge them and get a digest that is statistically equivalent to a digest from time(1) to time(3)
continuously.

·         They are sort of idempotent, in that if you take two identical digests and merge
them, you get almost the same object.  However, the result object will be scaled as summarizing
twice the number of input data points.

·         Which is why it DOESN’T work to try to merge overlapping sampling periods.  To
give a crude example, if you have a digest from time(1) to time(3) and another digest from
time(2) to time(4), and merge them, the samples from time(2) to time(3) will be over-represented
by a factor of 2x, which should be expected to skew the distribution (unless the distribution
really is constant for all sub-windows – which would mean we don’t need Sliding Windows
because nothing changes).

 

The Outlier Analysis profiles linked above try to implement a sliding window, in which each
profile period summarizes the Median Absolute Deviation distribution of the last five profile
periods only.  An “Outlier MAD Score” can then be determined by comparing the deviation
of a new data point to the last MAD distribution recorded in the Profile.  This allows for
changes over time in the statistical distribution of inputs, but does not make the measure
unduly sensitive to just the last minute or two.  This is a typical use case for Sliding Windows.

 

Both example codes trip on how to do the sliding window in the context of a Profile.  At sampling
period boundaries, both either compose the “result” digest or initialize the “next”
digest by reading the previous 5 result digests.  That is wrong, because it ignores the fact
that those digests aren’t just for their time periods.  They too were composed with THEIR
preceding 5 digests, each of which were composed with their preceding 5 digests, which in
turn… etc.  The end result is sort of like the way Madieras or some brandies are aged via
the Solera process with fractional blending.  You don’t get a true sliding window, which
sharply drops the past, you get a continuous dilution of the past. In fact, it’s wrong to
assume that the “tail” of the far past is more diluted than the near past! – It would
be with some algorithms, but the alg used in these two examples causes the far past to become
an exponentially MORE important fraction of the overall data than the near past – much worse
than simply turning on digesting at time(0) and leaving it on, with no attempt at windowing.
 (Simulate it in a spreadsheet and you’ll see.)

 

We need a Profiler structure that assists in creating Sliding Window profiles.  The problem
is that Profiles let you save only one thing (quantity or object) per sampling period, and
that’s typically a different “thing” (object type or scale) than you want to use to
compose the result for each windowed span.  One way to do it correctly would be with two Profiles,
like this:

 

(SOLUTION A)

{

  "profiles": [

    {

      "profile": "sketchy_mad",

      "foreach": "'global'",

      "init" : {

        "s": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": "s"

    },

    {

      "profile": "windowed_mad",

      "foreach": "'global'",

      "init" : { },

      "update": { },

      "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',

'global', 5, 'MINUTES'))"

    }

  ]

}

 

This is typical.  You have a fine-grain sampling period that you want to “tumble”, and
a broader window that you want to “slide” or “roll” along the underlying smaller sampling
periods.

The above pair of profiles is clear enough, but not real efficient.  Also, you can’t guarantee
they’ll stay exactly synchronized.  It would be better to combine them somehow.

 

In these examples, we only need to look back 5 sampling periods, so the following would work
(NOTE: requires Casey’s PR#668 that preserves state between sampling periods): 

 

(SOLUTION B)

{

  "profiles": [

    {

      "profile": "windowed_mad",

      "foreach": "'global'",

      "init" : {

        "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",

        "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",

        "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",

       "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",

        "s0": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s0": "OUTLIER_MAD_ADD(s0, value)"

        },

      "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”

    }

  ]

}

 

This works fine for a sliding window of size 5 sample periods.  But what if it was 15?  Or
60?

No doubt we could implement a ring buffer with a stellar list object, but it’s gonna get
ugly(er) and slow.

 

A much more elegant solution would be to allow “result” to write two objects.  The result
could be viewed either as two Profiles, or as a Profile with multiple addressable columns
of hbase output.  For instance:

(SOLUTION C)

{

  "profiles": [

    {

      "profile": "[ 'sketchy_mad', 'windowed_mad' ]",

      "foreach": "'global'",

      "init" : {

        "s": "OUTLIER_MAD_INIT()"

        },

      "update": {

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": {

        "sketchy_mad": "s",

        "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, PROFILE_GET('sketchy_mad',

'global', 4, 'MINUTES'))"

        }

    }

  ]

}

 

This is clean and tight, and I think will be easily generalized to any Sliding Window scenario.

BTW, given that each write has to work it’s way through more processing before getting to
HBase, it’s best to say that all writes happen AFTER all the “result” processing, which
is why I showed the MERGE above retrieving only 4 sampling periods prior to the current one,
and then add the current one from local state.

 

What do you think?  Can anyone specify a better way to do Sliding Window profiles correctly
with current functionality?

 

 

Casey noted that if this feature were available, he would like to consider separating the
value distribution from the deviation distribution, in the OUTLIER_MAD state objects.  He
also notes that the hypothetical OUTLIER_MAD_INIT() function needs to take a starting median,
as argument.  So the final Sliding Window Profile for OUTLIER_MAD state tracking might look
more like:

 

{

  "profiles": [

    {

      "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",

      "foreach": "'global'",

      "init" : {

        "median": "STATS_PERCENTILE(STATS_MERGE(PROFILE_GET('sketchy_values', 'global', 5,
'MINUTES')), 50)",

        "s": "OUTLIER_MAD_INIT(median)",

        "val_dist": "STATS_INIT()"

        },

      "update": {

        "val_dist": "STATS_ADD(val_dist, value)",

        "s": "OUTLIER_MAD_ADD(s, value)"

        },

      "result": {

        "sketchy_values": "val_dist",

        "sketchy_mad": "s",

        "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s, PROFILE_GET('sketchy_mad',

'global', 4, 'MINUTES'))"

        }

    },

  ]

}

 

Thanks,

--Matt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message