metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] How to do Sliding Windows in Profiler
Date Wed, 25 Jan 2017 15:02:35 GMT
So, I think the key takeaway, at least for me, and my key feedback is that:

   - We should separate the wrong behavior of MAD with the ability to
   improve windowing.  We can fix one without the other and, in my opinion,
   the key fix for MAD is keeping the value distribution independent of the
   MAD state, as Matt noted at the end of this email.
   - Because the above necessitates tracking two things, it'd be nice to
   adopt Matt's suggestion to be able to sort of merge profiles track two
   results.

One more thing, in general, if you ever want to merge outputs on read, then
you should never write out the merged data (unless the data you're writing
has a way to untangle the overlap).  They are two separate semantics.

On merge on read, you choose the window at read time.  An example of this
for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function which
takes the value distribution would be:
{
  "profiles": [
    {
      "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
      "foreach": "'global'",
      "init" : {
        "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
'global', 5, 'MINUTES')))",
        "val_dist": "STATS_INIT()"
        },
      "update": {
        "val_dist": "STATS_ADD(val_dist, value)",
        "s": "OUTLIER_MAD_ADD(s, value)"
        },
      "result": {
        "sketchy_values": "val_dist",
        "sketchy_mad": "s",
        }
    }
  ]
}

You would then read this in the enrichment topology or wherever
via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
'global', 5, 'MINUTES')), value)


On merge on write, you choose the window size on write time and should read
only the latest profile entry on read, so that would look like:
{
  "profiles": [
    {
      "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
      "foreach": "'global'",
      "init" : {
        "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
'global', 5, 'MINUTES')))",
        "val_dist": "STATS_INIT()"
        },
      "update": {
        "val_dist": "STATS_ADD(val_dist, value)",
        "s": "OUTLIER_MAD_ADD(s, value)"
        },
      "result": {
        "sketchy_values": "val_dist",
        "sketchy_mad": "s",
        "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
        }
    }
  ]
}


You would then score pulling only the most recent MAD state.  Merging
across them would double count entries, as Matt mentioned above
OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
'MINUTES')), value)

So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
multiple outputs.  Also, different but coupled problem, MAD needs fixing to
decouple tracking value distributions from the distribution of the
deviation from the median.

Casey

On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <nick@nickallen.org> wrote:

> That was a lot to digest so I apologize if I have missed some of your
> thought process.  I probably need to read through this a couple more times.
>
>
> Can anyone specify a better way to do Sliding Window profiles correctly
> > with current functionality?
>
>
> Could we not change the underlying implementation of the profiler to
> support sliding windows?  Ideally the same profile definition could be
> applied to either a tumbling or sliding window, rather than requiring a
> change in the profile definition itself.
>
> The work that I had done for METRON-590, made it possible to configure
> either a tumbling or sliding window.  The downside of that work is that it
> was an all-or-nothing change, all profiles (in the same topology) were
> either tumbling or sliding.  But perhaps we can enhance it from there.
>
>
> A much more elegant solution would be to allow “result” to write two
> > objects.
>
>
> Yes, I like this! I have had the same thought myself.  This would be a
> simple change and would provide the user with some flexibility.  Does this
> change solve the entire problem though?
>
>
> Obviously, I do need to sit down and think on this a bit more.  I really
> wish we could handle these use cases with a profile definition that remains
> simple and easy to grok.  Thanks for spending the time to think through
> this.
>
> On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <mattf@apache.org> wrote:
>
> > Hi all,
> >
> > Casey and I had an interesting chat yesterday, during which we agreed
> that
> > the example code for Outlier Analysis in https://github.com/apache/
> > incubator-metron/blob/master/metron-analytics/metron-
> statistics/README.md
> > and the revised example code in https://issues.apache.org/
> > jira/browse/METRON-668 (as of 23 January) both do not correctly implement
> > the desired Sliding Window model.  This email gives the argument for why,
> > and proposes a couple ways to do it right.  Your input and preferences
> are
> > requested.
> >
> >
> >
> > First a couple statements about the STATS object that underlies most
> > interesting Profile work:
> >
> > ·         The STATS object is a t-digest.  It summarizes a set of data
> > points, such as those received during a sampling period, in a way that is
> > nominally O(1) regardless of the input number of data points, and
> preserves
> > the info about the “shape” of the statistical distribution of those
> > points.  Not only info about averages and standard deviations, but also
> > about medians and percentiles (which, btw, is a very different kind of
> > information), is preserved and can be calculated correctly to within a
> > given error epsilon.  Since it is a summary, however, time information is
> > lost.
> >
> > ·         STATS objects, these digests of sampling periods, are
> MERGEABLE,
> > meaning if you have a digest from time(1) to time(2), and another digest
> > from time(2) to time (3), you can merge them and get a digest that is
> > statistically equivalent to a digest from time(1) to time(3)
> continuously.
> >
> > ·         They are sort of idempotent, in that if you take two identical
> > digests and merge them, you get almost the same object.  However, the
> > result object will be scaled as summarizing twice the number of input
> data
> > points.
> >
> > ·         Which is why it DOESN’T work to try to merge overlapping
> > sampling periods.  To give a crude example, if you have a digest from
> > time(1) to time(3) and another digest from time(2) to time(4), and merge
> > them, the samples from time(2) to time(3) will be over-represented by a
> > factor of 2x, which should be expected to skew the distribution (unless
> the
> > distribution really is constant for all sub-windows – which would mean we
> > don’t need Sliding Windows because nothing changes).
> >
> >
> >
> > The Outlier Analysis profiles linked above try to implement a sliding
> > window, in which each profile period summarizes the Median Absolute
> > Deviation distribution of the last five profile periods only.  An
> “Outlier
> > MAD Score” can then be determined by comparing the deviation of a new
> data
> > point to the last MAD distribution recorded in the Profile.  This allows
> > for changes over time in the statistical distribution of inputs, but does
> > not make the measure unduly sensitive to just the last minute or two.
> This
> > is a typical use case for Sliding Windows.
> >
> >
> >
> > Both example codes trip on how to do the sliding window in the context of
> > a Profile.  At sampling period boundaries, both either compose the
> “result”
> > digest or initialize the “next” digest by reading the previous 5 result
> > digests.  That is wrong, because it ignores the fact that those digests
> > aren’t just for their time periods.  They too were composed with THEIR
> > preceding 5 digests, each of which were composed with their preceding 5
> > digests, which in turn… etc.  The end result is sort of like the way
> > Madieras or some brandies are aged via the Solera process with fractional
> > blending.  You don’t get a true sliding window, which sharply drops the
> > past, you get a continuous dilution of the past. In fact, it’s wrong to
> > assume that the “tail” of the far past is more diluted than the near
> past!
> > – It would be with some algorithms, but the alg used in these two
> examples
> > causes the far past to become an exponentially MORE important fraction of
> > the overall data than the near past – much worse than simply turning on
> > digesting at time(0) and leaving it on, with no attempt at windowing.
> > (Simulate it in a spreadsheet and you’ll see.)
> >
> >
> >
> > We need a Profiler structure that assists in creating Sliding Window
> > profiles.  The problem is that Profiles let you save only one thing
> > (quantity or object) per sampling period, and that’s typically a
> different
> > “thing” (object type or scale) than you want to use to compose the result
> > for each windowed span.  One way to do it correctly would be with two
> > Profiles, like this:
> >
> >
> >
> > (SOLUTION A)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "sketchy_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": "s"
> >
> >     },
> >
> >     {
> >
> >       "profile": "windowed_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : { },
> >
> >       "update": { },
> >
> >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> >
> > 'global', 5, 'MINUTES'))"
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This is typical.  You have a fine-grain sampling period that you want to
> > “tumble”, and a broader window that you want to “slide” or “roll” along
> the
> > underlying smaller sampling periods.
> >
> > The above pair of profiles is clear enough, but not real efficient.
> Also,
> > you can’t guarantee they’ll stay exactly synchronized.  It would be
> better
> > to combine them somehow.
> >
> >
> >
> > In these examples, we only need to look back 5 sampling periods, so the
> > following would work (NOTE: requires Casey’s PR#668 that preserves state
> > between sampling periods):
> >
> >
> >
> > (SOLUTION B)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "windowed_mad",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> >
> >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> >
> >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> >
> >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> >
> >         "s0": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> >
> >         },
> >
> >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This works fine for a sliding window of size 5 sample periods.  But what
> > if it was 15?  Or 60?
> >
> > No doubt we could implement a ring buffer with a stellar list object, but
> > it’s gonna get ugly(er) and slow.
> >
> >
> >
> > A much more elegant solution would be to allow “result” to write two
> > objects.  The result could be viewed either as two Profiles, or as a
> > Profile with multiple addressable columns of hbase output.  For instance:
> >
> > (SOLUTION C)
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "s": "OUTLIER_MAD_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": {
> >
> >         "sketchy_mad": "s",
> >
> >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad',
> >
> > 'global', 4, 'MINUTES'))"
> >
> >         }
> >
> >     }
> >
> >   ]
> >
> > }
> >
> >
> >
> > This is clean and tight, and I think will be easily generalized to any
> > Sliding Window scenario.
> >
> > BTW, given that each write has to work it’s way through more processing
> > before getting to HBase, it’s best to say that all writes happen AFTER
> all
> > the “result” processing, which is why I showed the MERGE above retrieving
> > only 4 sampling periods prior to the current one, and then add the
> current
> > one from local state.
> >
> >
> >
> > What do you think?  Can anyone specify a better way to do Sliding Window
> > profiles correctly with current functionality?
> >
> >
> >
> >
> >
> > Casey noted that if this feature were available, he would like to
> consider
> > separating the value distribution from the deviation distribution, in the
> > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > OUTLIER_MAD_INIT() function needs to take a starting median, as argument.
> > So the final Sliding Window Profile for OUTLIER_MAD state tracking might
> > look more like:
> >
> >
> >
> > {
> >
> >   "profiles": [
> >
> >     {
> >
> >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> >
> >       "foreach": "'global'",
> >
> >       "init" : {
> >
> >         "median": "STATS_PERCENTILE(STATS_MERGE(
> PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')), 50)",
> >
> >         "s": "OUTLIER_MAD_INIT(median)",
> >
> >         "val_dist": "STATS_INIT()"
> >
> >         },
> >
> >       "update": {
> >
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >
> >         },
> >
> >       "result": {
> >
> >         "sketchy_values": "val_dist",
> >
> >         "sketchy_mad": "s",
> >
> >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad',
> >
> > 'global', 4, 'MINUTES'))"
> >
> >         }
> >
> >     },
> >
> >   ]
> >
> > }
> >
> >
> >
> > Thanks,
> >
> > --Matt
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> --
> Nick Allen <nick@nickallen.org>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message