metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Merriman <merrim...@gmail.com>
Subject Re: [DISCUSS] Error Indexing
Date Tue, 24 Jan 2017 21:10:43 GMT
That use case makes sense to me.  I don't think it will require that much
additional effort either.

On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <zeolla@gmail.com> wrote:

> Regarding error vs validation - Either way I'm not very concerned.  I
> initially assumed they would be combined and agree with that approach, but
> splitting them out isn't a very big deal to me either.
>
> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere else
> where it's not possible to pick out the exact thing causing the issue) it
> would be a hash of the complete message.
>
> Regarding the architecture, I mostly agree with James except that I think
> step 3 needs to also be able to somehow group errors via the original
> data (identify
> replays, identify repeat issues with data in a specific field, issues with
> consistently different data, etc.).  This is essentially the first step of
> troubleshooting, which I assume you are doing if you're looking at the
> error dashboard.
>
> If the hash gets moved out of the initial implementation, I'm fairly
> certain you lose this ability.  The point here isn't to handle long fields
> (although that's a benefit of this approach), it's to attach a unique
> identifier to the error/validation issue message that links it to the
> original problem.  I'd be happy to consider alternative solutions to this
> problem (for instance, actually sending across the data itself) I just
> haven't been able to think of another way to do this that I like better.
>
> Jon
>
> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <merrimanr@gmail.com> wrote:
>
> > We also need a JIRA for any install/Ansible/MPack work needed.
> >
> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsirota@apache.org>
> wrote:
> >
> > > Now that I had some time to think about it I would collapse all error
> and
> > > validation topics into one.  We can differentiate between different
> views
> > > of the data (split by error source etc) via Kibana dashboards.  I would
> > > implement this feature incrementally.  First I would modify all the
> bolts
> > > to log to a single topic.  Second, I would get the error indexing done
> by
> > > attaching the indexing topology to the error topic. Third I would
> create
> > > the necessary dashboards to view errors and validation failures by
> > source.
> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors or
> > > fields that are too long.  It seems like a separate feature that we
> need
> > to
> > > think through.  We may need a stellar function around that.
> > >
> > > Thanks,
> > > James
> > >
> > > 24.01.2017, 10:25, "Ryan Merriman" <merrimanr@gmail.com>:
> > > > I understand what Jon is talking about. He's proposing we hash the
> > value
> > > > that caused the error, not necessarily the error message itself. For
> an
> > > > enrichment this is easy. Just pass along the field value that failed
> > > > enrichment. For other cases the field that caused the error may not
> be
> > so
> > > > obvious. Take parser validation for example. The message is validated
> > as
> > > > a whole and it may not be easy to determine which field is the cause.
> > In
> > > > that case would a hash of the whole message work?
> > > >
> > > > There is a broader architectural discussion that needs to happen
> before
> > > we
> > > > can implement this. Currently we have an indexing topology that reads
> > > from
> > > > 1 topic and writes messages to ES but errors are written to several
> > > > different topics:
> > > >
> > > >    - parser_error
> > > >    - parser_invalid
> > > >    - enrichments_error
> > > >    - threatintel_error
> > > >    - indexing_error
> > > >
> > > > I can see 4 possible approaches to implementing this:
> > > >
> > > >    1. Create an index topology for each error topic
> > > >       1. Good because we can easily reuse the indexing topology and
> > would
> > > >       require the least development effort
> > > >       2. Bad because it would consume a lot of extra worker slots
> > > >    2. Move the topic name into the error JSON message as a new
> > > "error_type"
> > > >    field and write all messages to the indexing topic
> > > >       1. Good because we don't need to create a new topology
> > > >       2. Bad because we would be flowing data and errors through the
> > same
> > > >       topology. A spike in errors could affect message indexing.
> > > >    3. Compromise between 1 and 2. Create another indexing topology
> that
> > > is
> > > >    dedicated to indexing errors. Move the topic name into the error
> > JSON
> > > >    message as a new "error_type" field and write all errors to a
> single
> > > error
> > > >    topic.
> > > >    4. Write a completely new topology with multiple spouts (1 for
> each
> > > >    error type listed above) that all feed into a single
> > > BulkMessageWriterBolt.
> > > >       1. Good because the current topologies would not need to change
> > > >       2. Bad because it would require the most development effort,
> > would
> > > >       not reuse existing topologies and takes up more worker slots
> > than 3
> > > >
> > > > Are there other approaches I haven't thought of? I think 1 and 2 are
> > off
> > > > the table because they are shortcuts and not good long-term
> solutions.
> > 3
> > > > would be my choice because it introduces less complexity than 4.
> > > Thoughts?
> > > >
> > > > Ryan
> > > >
> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <zeolla@gmail.com>
> > > wrote:
> > > >
> > > >>  In that case the hash would be of the value in the IP field, such
> as
> > > >>  sha3(8.8.8.8).
> > > >>
> > > >>  Jon
> > > >>
> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsirota@apache.org>
> > wrote:
> > > >>
> > > >>  > Jon,
> > > >>  >
> > > >>  > I am still not entirely following why we would want to use
> hashing.
> > > For
> > > >>  > example if my error is "Your IP field is invalid and failed
> > > validation"
> > > >>  > hashing this error string will always result in the same hash.
> Why
> > > not
> > > >>  > just use the actual error string? Can you provide an example
> where
> > > you
> > > >>  > would use it?
> > > >>  >
> > > >>  > Thanks,
> > > >>  > James
> > > >>  >
> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <zeolla@gmail.com>:
> > > >>  > > For 1 - I'm good with that.
> > > >>  > >
> > > >>  > > I'm talking about hashing the relevant content itself not
the
> > > error.
> > > >>  Some
> > > >>  > > benefits are (1) minimize load on search index (there's
minimal
> > > benefit
> > > >>  > in
> > > >>  > > spending the CPU and disk to keep it at full fidelity (tokenize
> > and
> > > >>  > store))
> > > >>  > > (2) provide something to key on for dashboards (assuming
a good
> > > hash
> > > >>  > > algorithm that avoids collisions and is second preimage
> > resistant)
> > > and
> > > >>  > (3)
> > > >>  > > specific to errors, if the issue is that it failed to index,
a
> > hash
> > > >>  gives
> > > >>  > > us some protection that the issue will not occur twice.
> > > >>  > >
> > > >>  > > Jon
> > > >>  > >
> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <jsirota@apache.org
> >
> > > wrote:
> > > >>  > >
> > > >>  > > Jon,
> > > >>  > >
> > > >>  > > With regards to 1, collapsing to a single dashboard for
each
> > would
> > > be
> > > >>  > > fine. So we would have one error index and one "failed
to
> > validate"
> > > >>  > > index. The distinction is that errors would be things that
went
> > > wrong
> > > >>  > > during stream processing (failed to parse, etc...), while
> > > validation
> > > >>  > > failures are messages that explicitly failed stellar
> > > validation/schema
> > > >>  > > enforcement. There should be relatively few of the second
type.
> > > >>  > >
> > > >>  > > With respect to 3, why do you want the error hashed? Why
not
> just
> > > >>  search
> > > >>  > > for the error text?
> > > >>  > >
> > > >>  > > Thanks,
> > > >>  > > James
> > > >>  > >
> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <zeolla@gmail.com>:
> > > >>  > >> As someone who currently fills the platform engineer
role, I
> can
> > > give
> > > >>  > this
> > > >>  > >> idea a huge +1. My thoughts:
> > > >>  > >>
> > > >>  > >> 1. I think it depends on exactly what data is pushed
into the
> > > index
> > > >>  > (#3).
> > > >>  > >> However, assuming the errors you proposed recording,
I can't
> see
> > > huge
> > > >>  > >> benefits to having more than one dashboard. I would
be happy
> to
> > be
> > > >>  > >> persuaded otherwise.
> > > >>  > >>
> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition
to
> > > >>  indexing
> > > >>  > is
> > > >>  > >> a good thing. Using METRON-510
> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510>
as a case
> > > study,
> > > >>  > there
> > > >>  > >> is the potential in this environment for attacker-controlled
> > data
> > > to
> > > >>  > >
> > > >>  > > result
> > > >>  > >> in processing errors which could be a method of evading
> security
> > > >>  > >> monitoring. Once an attack is identified, the long
term HDFS
> > > storage
> > > >>  > would
> > > >>  > >> allow better historical analysis for low-and-slow/persistent
> > > attacks
> > > >>  > (I'm
> > > >>  > >> thinking of a method of data exfil that also won't
> successfully
> > > get
> > > >>  > stored
> > > >>  > >> in Lucene, but is hard to identify over a short period
of
> time).
> > > >>  > >> - Along this line, I think that there are various parts
of
> > Metron
> > > >>  > (this
> > > >>  > >> included) which could benefit from having method of
> configuring
> > > data
> > > >>  > aging
> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> > > >>  > >>
> > > >>  > >> 3. I would potentially add a hash of the content that
failed
> > > >>  > validation to
> > > >>  > >> help identify repeats over time with less of a concern
that
> > you'd
> > > >>  have
> > > >>  > >
> > > >>  > > back
> > > >>  > >> to back failures (i.e. instead of storing the value
itself).
> > > >>  > Additionally,
> > > >>  > >> I think it's helpful to be able to search all times
there was
> an
> > > >>  > indexing
> > > >>  > >> error (instead of it hitting the catch-all).
> > > >>  > >>
> > > >>  > >> Jon
> > > >>  > >>
> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> > jsirota@apache.org>
> > > >>  > wrote:
> > > >>  > >>
> > > >>  > >> We already have a capability to capture bolt errors
and
> > validation
> > > >>  > errors
> > > >>  > >> and pipe them into a Kafka topic. I want to propose
that we
> > > attach a
> > > >>  > >> writer topology to the error and validation failed
kafka
> topics
> > so
> > > >>  > that we
> > > >>  > >> can (a) create a new ES index for these errors and
(b) create
> a
> > > new
> > > >>  > Kibana
> > > >>  > >> dashboard to visualize them. The benefit would be that
errors
> > and
> > > >>  > >> validation failures would be easier to see and analyze.
> > > >>  > >>
> > > >>  > >> I am seeking feedback on the following:
> > > >>  > >>
> > > >>  > >> - How granular would we want this feature to be? Think
we
> would
> > > want
> > > >>  > one
> > > >>  > >> index/dashboard per source? Or would it be better to
collapse
> > > >>  > everything
> > > >>  > >> into the same index?
> > > >>  > >> - Do we care about storing these errors in HDFS as
well? Or is
> > > >>  indexing
> > > >>  > >> them enough?
> > > >>  > >> - What types of errors should we record? I am proposing:
> > > >>  > >>
> > > >>  > >> For error reporting:
> > > >>  > >> --Message failed to parse
> > > >>  > >> --Enrichment failed to enrich
> > > >>  > >> --Threat intel feed failures
> > > >>  > >> --Generic catch-all for all other errors
> > > >>  > >>
> > > >>  > >> For validation reporting:
> > > >>  > >> --What part of message failed validation
> > > >>  > >> --What stellar validator caused the failure
> > > >>  > >>
> > > >>  > >> -------------------
> > > >>  > >> Thank you,
> > > >>  > >>
> > > >>  > >> James Sirota
> > > >>  > >> PPMC- Apache Metron (Incubating)
> > > >>  > >> jsirota AT apache DOT org
> > > >>  > >>
> > > >>  > >> --
> > > >>  > >>
> > > >>  > >> Jon
> > > >>  > >>
> > > >>  > >> Sent from my mobile device
> > > >>  > >
> > > >>  > > -------------------
> > > >>  > > Thank you,
> > > >>  > >
> > > >>  > > James Sirota
> > > >>  > > PPMC- Apache Metron (Incubating)
> > > >>  > > jsirota AT apache DOT org
> > > >>  > >
> > > >>  > > --
> > > >>  > >
> > > >>  > > Jon
> > > >>  > >
> > > >>  > > Sent from my mobile device
> > > >>  >
> > > >>  > -------------------
> > > >>  > Thank you,
> > > >>  >
> > > >>  > James Sirota
> > > >>  > PPMC- Apache Metron (Incubating)
> > > >>  > jsirota AT apache DOT org
> > > >>  >
> > > >>  --
> > > >>
> > > >>  Jon
> > > >>
> > > >>  Sent from my mobile device
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> >
> --
>
> Jon
>
> Sent from my mobile device
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message