metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] How to do Sliding Windows in Profiler
Date Fri, 27 Jan 2017 13:56:45 GMT
So, I did some digging into the implementation of Median Absolute Deviation
that we ship.  I think we're doing the right thing here for sliding
windows.  The key assumption that we were making as part of this discussion
is that were storing 2 distributions in the state:

   - The distribution of the values
   - The distribution of the deviation from the median

In fact, we are actually storing 4 distributions:

   - The distribution of the values for this tick
      - This is used ONLY when merging states
   - The distribution of the values for the whole sliding window
      - This is used ONLY for scoring
   - The distribution of the deviation from the median of the sliding
   window for this tick
      - This is used ONLY when merging states
      - The median used here is the median of the merged tick-granular
      distribution of values across the whole window
   - The distribution of the deviation from the median of the sliding
   window for the whole sliding window
      - This is used ONLY for scoring
      - The median used here is the median of the merged tick-granular
      distribution of values across the whole window


The tick-granular distributions are used for merging, so we do not get
overlapping distributions of values or deviations from medians.  The
window-granular distributions are used for the actual scoring, so you get
the full context of the window.

I would urge you guys to check my work and let me know if I interpreted it
wrong by looking at the State class here
<https://github.com/apache/incubator-metron/blob/master/metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/outlier/MedianAbsoluteDeviationFunctions.java#L33>
.

All that being said, I still think that providing the ability to structure
multiple profiles at once as Matt suggested would be extremely nice and I'm
in favor of it, on the whole.

Casey

On Wed, Jan 25, 2017 at 2:06 PM, Michael Miklavcic <
michael.miklavcic@gmail.com> wrote:

> Hey guys,
>
> 1. I'm going to Google that brandy analogy later. It sounds tasty :)
> 2. The option for reading/writing multiple profiles makes sense to me and
> provides much greater flexibility and efficiency.
> 3. My gut reaction to reading through this is "wow." This seems really
> complicated. I'm all for the flexibility provided by these composable
> functions and think this would be much better for users with either some
> templates or other form of abstraction to simplify common patterns.
>
> Mike
>
>
> On Wed, Jan 25, 2017 at 8:02 AM, Casey Stella <cestella@gmail.com> wrote:
>
> > So, I think the key takeaway, at least for me, and my key feedback is
> that:
> >
> >    - We should separate the wrong behavior of MAD with the ability to
> >    improve windowing.  We can fix one without the other and, in my
> opinion,
> >    the key fix for MAD is keeping the value distribution independent of
> the
> >    MAD state, as Matt noted at the end of this email.
> >    - Because the above necessitates tracking two things, it'd be nice to
> >    adopt Matt's suggestion to be able to sort of merge profiles track two
> >    results.
> >
> > One more thing, in general, if you ever want to merge outputs on read,
> then
> > you should never write out the merged data (unless the data you're
> writing
> > has a way to untangle the overlap).  They are two separate semantics.
> >
> > On merge on read, you choose the window at read time.  An example of this
> > for a hypothetically adjusted MAD to have a OUTLIER_MAD_INIT function
> which
> > takes the value distribution would be:
> > {
> >   "profiles": [
> >     {
> >       "profile": "[ 'sketchy_values', 'sketchy_mad' ]",
> >       "foreach": "'global'",
> >       "init" : {
> >         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> >         "val_dist": "STATS_INIT()"
> >         },
> >       "update": {
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >         },
> >       "result": {
> >         "sketchy_values": "val_dist",
> >         "sketchy_mad": "s",
> >         }
> >     }
> >   ]
> > }
> >
> > You would then read this in the enrichment topology or wherever
> > via OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > 'global', 5, 'MINUTES')), value)
> >
> >
> > On merge on write, you choose the window size on write time and should
> read
> > only the latest profile entry on read, so that would look like:
> > {
> >   "profiles": [
> >     {
> >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad' ]",
> >       "foreach": "'global'",
> >       "init" : {
> >         "s": "OUTLIER_MAD_INIT(STATS_MERGE(PROFILE_GET('sketchy_values',
> > 'global', 5, 'MINUTES')))",
> >         "val_dist": "STATS_INIT()"
> >         },
> >       "update": {
> >         "val_dist": "STATS_ADD(val_dist, value)",
> >         "s": "OUTLIER_MAD_ADD(s, value)"
> >         },
> >       "result": {
> >         "sketchy_values": "val_dist",
> >         "sketchy_mad": "s",
> >         "windowed_mad" : "OUTLIER_MAD_STATE_MERGE(s,
> > PROFILE_GET('sketchy_mad', 'global', 4, 'MINUTES'))"
> >         }
> >     }
> >   ]
> > }
> >
> >
> > You would then score pulling only the most recent MAD state.  Merging
> > across them would double count entries, as Matt mentioned above
> > OUTLIER_MAD_SCORE(GET_LAST(PROFILE_GET('sketchy_mad', 'global', 1,
> > 'MINUTES')), value)
> >
> > So, tl;dr, I'm in favor of Matt's syntax to merge profiles and give
> > multiple outputs.  Also, different but coupled problem, MAD needs fixing
> to
> > decouple tracking value distributions from the distribution of the
> > deviation from the median.
> >
> > Casey
> >
> > On Wed, Jan 25, 2017 at 9:28 AM, Nick Allen <nick@nickallen.org> wrote:
> >
> > > That was a lot to digest so I apologize if I have missed some of your
> > > thought process.  I probably need to read through this a couple more
> > times.
> > >
> > >
> > > Can anyone specify a better way to do Sliding Window profiles correctly
> > > > with current functionality?
> > >
> > >
> > > Could we not change the underlying implementation of the profiler to
> > > support sliding windows?  Ideally the same profile definition could be
> > > applied to either a tumbling or sliding window, rather than requiring a
> > > change in the profile definition itself.
> > >
> > > The work that I had done for METRON-590, made it possible to configure
> > > either a tumbling or sliding window.  The downside of that work is that
> > it
> > > was an all-or-nothing change, all profiles (in the same topology) were
> > > either tumbling or sliding.  But perhaps we can enhance it from there.
> > >
> > >
> > > A much more elegant solution would be to allow “result” to write two
> > > > objects.
> > >
> > >
> > > Yes, I like this! I have had the same thought myself.  This would be a
> > > simple change and would provide the user with some flexibility.  Does
> > this
> > > change solve the entire problem though?
> > >
> > >
> > > Obviously, I do need to sit down and think on this a bit more.  I
> really
> > > wish we could handle these use cases with a profile definition that
> > remains
> > > simple and easy to grok.  Thanks for spending the time to think through
> > > this.
> > >
> > > On Wed, Jan 25, 2017 at 2:00 AM, Matt Foley <mattf@apache.org> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Casey and I had an interesting chat yesterday, during which we agreed
> > > that
> > > > the example code for Outlier Analysis in https://github.com/apache/
> > > > incubator-metron/blob/master/metron-analytics/metron-
> > > statistics/README.md
> > > > and the revised example code in https://issues.apache.org/
> > > > jira/browse/METRON-668 (as of 23 January) both do not correctly
> > implement
> > > > the desired Sliding Window model.  This email gives the argument for
> > why,
> > > > and proposes a couple ways to do it right.  Your input and
> preferences
> > > are
> > > > requested.
> > > >
> > > >
> > > >
> > > > First a couple statements about the STATS object that underlies most
> > > > interesting Profile work:
> > > >
> > > > ·         The STATS object is a t-digest.  It summarizes a set of
> data
> > > > points, such as those received during a sampling period, in a way
> that
> > is
> > > > nominally O(1) regardless of the input number of data points, and
> > > preserves
> > > > the info about the “shape” of the statistical distribution of those
> > > > points.  Not only info about averages and standard deviations, but
> also
> > > > about medians and percentiles (which, btw, is a very different kind
> of
> > > > information), is preserved and can be calculated correctly to within
> a
> > > > given error epsilon.  Since it is a summary, however, time
> information
> > is
> > > > lost.
> > > >
> > > > ·         STATS objects, these digests of sampling periods, are
> > > MERGEABLE,
> > > > meaning if you have a digest from time(1) to time(2), and another
> > digest
> > > > from time(2) to time (3), you can merge them and get a digest that is
> > > > statistically equivalent to a digest from time(1) to time(3)
> > > continuously.
> > > >
> > > > ·         They are sort of idempotent, in that if you take two
> > identical
> > > > digests and merge them, you get almost the same object.  However, the
> > > > result object will be scaled as summarizing twice the number of input
> > > data
> > > > points.
> > > >
> > > > ·         Which is why it DOESN’T work to try to merge overlapping
> > > > sampling periods.  To give a crude example, if you have a digest from
> > > > time(1) to time(3) and another digest from time(2) to time(4), and
> > merge
> > > > them, the samples from time(2) to time(3) will be over-represented
> by a
> > > > factor of 2x, which should be expected to skew the distribution
> (unless
> > > the
> > > > distribution really is constant for all sub-windows – which would
> mean
> > we
> > > > don’t need Sliding Windows because nothing changes).
> > > >
> > > >
> > > >
> > > > The Outlier Analysis profiles linked above try to implement a sliding
> > > > window, in which each profile period summarizes the Median Absolute
> > > > Deviation distribution of the last five profile periods only.  An
> > > “Outlier
> > > > MAD Score” can then be determined by comparing the deviation of a new
> > > data
> > > > point to the last MAD distribution recorded in the Profile.  This
> > allows
> > > > for changes over time in the statistical distribution of inputs, but
> > does
> > > > not make the measure unduly sensitive to just the last minute or two.
> > > This
> > > > is a typical use case for Sliding Windows.
> > > >
> > > >
> > > >
> > > > Both example codes trip on how to do the sliding window in the
> context
> > of
> > > > a Profile.  At sampling period boundaries, both either compose the
> > > “result”
> > > > digest or initialize the “next” digest by reading the previous 5
> result
> > > > digests.  That is wrong, because it ignores the fact that those
> digests
> > > > aren’t just for their time periods.  They too were composed with
> THEIR
> > > > preceding 5 digests, each of which were composed with their
> preceding 5
> > > > digests, which in turn… etc.  The end result is sort of like the way
> > > > Madieras or some brandies are aged via the Solera process with
> > fractional
> > > > blending.  You don’t get a true sliding window, which sharply drops
> the
> > > > past, you get a continuous dilution of the past. In fact, it’s wrong
> to
> > > > assume that the “tail” of the far past is more diluted than the near
> > > past!
> > > > – It would be with some algorithms, but the alg used in these two
> > > examples
> > > > causes the far past to become an exponentially MORE important
> fraction
> > of
> > > > the overall data than the near past – much worse than simply turning
> on
> > > > digesting at time(0) and leaving it on, with no attempt at windowing.
> > > > (Simulate it in a spreadsheet and you’ll see.)
> > > >
> > > >
> > > >
> > > > We need a Profiler structure that assists in creating Sliding Window
> > > > profiles.  The problem is that Profiles let you save only one thing
> > > > (quantity or object) per sampling period, and that’s typically a
> > > different
> > > > “thing” (object type or scale) than you want to use to compose the
> > result
> > > > for each windowed span.  One way to do it correctly would be with two
> > > > Profiles, like this:
> > > >
> > > >
> > > >
> > > > (SOLUTION A)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "sketchy_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": "s"
> > > >
> > > >     },
> > > >
> > > >     {
> > > >
> > > >       "profile": "windowed_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : { },
> > > >
> > > >       "update": { },
> > > >
> > > >       "result": "OUTLIER_MAD_STATE_MERGE(PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 5, 'MINUTES'))"
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This is typical.  You have a fine-grain sampling period that you want
> > to
> > > > “tumble”, and a broader window that you want to “slide” or “roll”
> along
> > > the
> > > > underlying smaller sampling periods.
> > > >
> > > > The above pair of profiles is clear enough, but not real efficient.
> > > Also,
> > > > you can’t guarantee they’ll stay exactly synchronized.  It would be
> > > better
> > > > to combine them somehow.
> > > >
> > > >
> > > >
> > > > In these examples, we only need to look back 5 sampling periods, so
> the
> > > > following would work (NOTE: requires Casey’s PR#668 that preserves
> > state
> > > > between sampling periods):
> > > >
> > > >
> > > >
> > > > (SOLUTION B)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "windowed_mad",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s4": "if exists(s3) then s3 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s3": "if exists(s2) then s2 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s2": "if exists(s1) then s1 else OUTLIER_MAD_INIT()",
> > > >
> > > >        "s1": "if exists(s0) then s0 else OUTLIER_MAD_INIT()",
> > > >
> > > >         "s0": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s0": "OUTLIER_MAD_ADD(s0, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": "OUTLIER_MAD_STATE_MERGE(s0, s1, s2, s3, s4)”
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This works fine for a sliding window of size 5 sample periods.  But
> > what
> > > > if it was 15?  Or 60?
> > > >
> > > > No doubt we could implement a ring buffer with a stellar list object,
> > but
> > > > it’s gonna get ugly(er) and slow.
> > > >
> > > >
> > > >
> > > > A much more elegant solution would be to allow “result” to write two
> > > > objects.  The result could be viewed either as two Profiles, or as a
> > > > Profile with multiple addressable columns of hbase output.  For
> > instance:
> > > >
> > > > (SOLUTION C)
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "[ 'sketchy_mad', 'windowed_mad' ]",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "s": "OUTLIER_MAD_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": {
> > > >
> > > >         "sketchy_mad": "s",
> > > >
> > > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > > PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 4, 'MINUTES'))"
> > > >
> > > >         }
> > > >
> > > >     }
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > This is clean and tight, and I think will be easily generalized to
> any
> > > > Sliding Window scenario.
> > > >
> > > > BTW, given that each write has to work it’s way through more
> processing
> > > > before getting to HBase, it’s best to say that all writes happen
> AFTER
> > > all
> > > > the “result” processing, which is why I showed the MERGE above
> > retrieving
> > > > only 4 sampling periods prior to the current one, and then add the
> > > current
> > > > one from local state.
> > > >
> > > >
> > > >
> > > > What do you think?  Can anyone specify a better way to do Sliding
> > Window
> > > > profiles correctly with current functionality?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Casey noted that if this feature were available, he would like to
> > > consider
> > > > separating the value distribution from the deviation distribution, in
> > the
> > > > OUTLIER_MAD state objects.  He also notes that the hypothetical
> > > > OUTLIER_MAD_INIT() function needs to take a starting median, as
> > argument.
> > > > So the final Sliding Window Profile for OUTLIER_MAD state tracking
> > might
> > > > look more like:
> > > >
> > > >
> > > >
> > > > {
> > > >
> > > >   "profiles": [
> > > >
> > > >     {
> > > >
> > > >       "profile": "[ 'sketchy_values', 'sketchy_mad', 'windowed_mad'
> ]",
> > > >
> > > >       "foreach": "'global'",
> > > >
> > > >       "init" : {
> > > >
> > > >         "median": "STATS_PERCENTILE(STATS_MERGE(
> > > PROFILE_GET('sketchy_values',
> > > > 'global', 5, 'MINUTES')), 50)",
> > > >
> > > >         "s": "OUTLIER_MAD_INIT(median)",
> > > >
> > > >         "val_dist": "STATS_INIT()"
> > > >
> > > >         },
> > > >
> > > >       "update": {
> > > >
> > > >         "val_dist": "STATS_ADD(val_dist, value)",
> > > >
> > > >         "s": "OUTLIER_MAD_ADD(s, value)"
> > > >
> > > >         },
> > > >
> > > >       "result": {
> > > >
> > > >         "sketchy_values": "val_dist",
> > > >
> > > >         "sketchy_mad": "s",
> > > >
> > > >         "windowed_mad": "OUTLIER_MAD_STATE_MERGE(s,
> > > > PROFILE_GET('sketchy_mad',
> > > >
> > > > 'global', 4, 'MINUTES'))"
> > > >
> > > >         }
> > > >
> > > >     },
> > > >
> > > >   ]
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > --Matt
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Nick Allen <nick@nickallen.org>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message