metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Merriman <merrim...@gmail.com>
Subject Re: [DISCUSS] Error Indexing
Date Tue, 24 Jan 2017 17:24:58 GMT
I understand what Jon is talking about.  He's proposing we hash the value
that caused the error, not necessarily the error message itself.  For an
enrichment this is easy.  Just pass along the field value that failed
enrichment.  For other cases the field that caused the error may not be so
obvious.  Take parser validation for example.  The message is validated as
a whole and it may not be easy to determine which field is the cause.  In
that case would a hash of the whole message work?

There is a broader architectural discussion that needs to happen before we
can implement this.  Currently we have an indexing topology that reads from
1 topic and writes messages to ES but errors are written to several
different topics:

   - parser_error
   - parser_invalid
   - enrichments_error
   - threatintel_error
   - indexing_error

I can see 4 possible approaches to implementing this:

   1. Create an index topology for each error topic
      1. Good because we can easily reuse the indexing topology and would
      require the least development effort
      2. Bad because it would consume a lot of extra worker slots
   2. Move the topic name into the error JSON message as a new "error_type"
   field and write all messages to the indexing topic
      1. Good because we don't need to create a new topology
      2. Bad because we would be flowing data and errors through the same
      topology.  A spike in errors could affect message indexing.
   3. Compromise between 1 and 2.  Create another indexing topology that is
   dedicated to indexing errors.  Move the topic name into the error JSON
   message as a new "error_type" field and write all errors to a single error
   topic.
   4. Write a completely new topology with multiple spouts (1 for each
   error type listed above) that all feed into a single BulkMessageWriterBolt.
      1. Good because the current topologies would not need to change
      2. Bad because it would require the most development effort, would
      not reuse existing topologies and takes up more worker slots than 3

Are there other approaches I haven't thought of?  I think 1 and 2 are off
the table because they are shortcuts and not good long-term solutions.  3
would be my choice because it introduces less complexity than 4.  Thoughts?

Ryan


On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <zeolla@gmail.com> wrote:

> In that case the hash would be of the value in the IP field, such as
> sha3(8.8.8.8).
>
> Jon
>
> On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsirota@apache.org> wrote:
>
> > Jon,
> >
> > I am still not entirely following why we would want to use hashing.  For
> > example if my error is "Your IP field is invalid and failed validation"
> > hashing this error string will always result in the same hash.  Why not
> > just use the actual error string? Can you provide an example where you
> > would use it?
> >
> > Thanks,
> > James
> >
> > 23.01.2017, 16:29, "Zeolla@GMail.com" <zeolla@gmail.com>:
> > > For 1 - I'm good with that.
> > >
> > > I'm talking about hashing the relevant content itself not the error.
> Some
> > > benefits are (1) minimize load on search index (there's minimal benefit
> > in
> > > spending the CPU and disk to keep it at full fidelity (tokenize and
> > store))
> > > (2) provide something to key on for dashboards (assuming a good hash
> > > algorithm that avoids collisions and is second preimage resistant) and
> > (3)
> > > specific to errors, if the issue is that it failed to index, a hash
> gives
> > > us some protection that the issue will not occur twice.
> > >
> > > Jon
> > >
> > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <jsirota@apache.org> wrote:
> > >
> > > Jon,
> > >
> > > With regards to 1, collapsing to a single dashboard for each would be
> > > fine. So we would have one error index and one "failed to validate"
> > > index. The distinction is that errors would be things that went wrong
> > > during stream processing (failed to parse, etc...), while validation
> > > failures are messages that explicitly failed stellar validation/schema
> > > enforcement. There should be relatively few of the second type.
> > >
> > > With respect to 3, why do you want the error hashed? Why not just
> search
> > > for the error text?
> > >
> > > Thanks,
> > > James
> > >
> > > 20.01.2017, 14:01, "Zeolla@GMail.com" <zeolla@gmail.com>:
> > >>  As someone who currently fills the platform engineer role, I can give
> > this
> > >>  idea a huge +1. My thoughts:
> > >>
> > >>  1. I think it depends on exactly what data is pushed into the index
> > (#3).
> > >>  However, assuming the errors you proposed recording, I can't see huge
> > >>  benefits to having more than one dashboard. I would be happy to be
> > >>  persuaded otherwise.
> > >>
> > >>  2. I would say yes, storing the errors in HDFS in addition to
> indexing
> > is
> > >>  a good thing. Using METRON-510
> > >>  <https://issues.apache.org/jira/browse/METRON-510> as a case study,
> > there
> > >>  is the potential in this environment for attacker-controlled data to
> > >
> > > result
> > >>  in processing errors which could be a method of evading security
> > >>  monitoring. Once an attack is identified, the long term HDFS storage
> > would
> > >>  allow better historical analysis for low-and-slow/persistent attacks
> > (I'm
> > >>  thinking of a method of data exfil that also won't successfully get
> > stored
> > >>  in Lucene, but is hard to identify over a short period of time).
> > >>   - Along this line, I think that there are various parts of Metron
> > (this
> > >>  included) which could benefit from having method of configuring data
> > aging
> > >>  by bucket in HDFS (Following Nick's comments here
> > >>  <https://issues.apache.org/jira/browse/METRON-477>).
> > >>
> > >>  3. I would potentially add a hash of the content that failed
> > validation to
> > >>  help identify repeats over time with less of a concern that you'd
> have
> > >
> > > back
> > >>  to back failures (i.e. instead of storing the value itself).
> > Additionally,
> > >>  I think it's helpful to be able to search all times there was an
> > indexing
> > >>  error (instead of it hitting the catch-all).
> > >>
> > >>  Jon
> > >>
> > >>  On Fri, Jan 20, 2017 at 1:17 PM James Sirota <jsirota@apache.org>
> > wrote:
> > >>
> > >>  We already have a capability to capture bolt errors and validation
> > errors
> > >>  and pipe them into a Kafka topic. I want to propose that we attach a
> > >>  writer topology to the error and validation failed kafka topics so
> > that we
> > >>  can (a) create a new ES index for these errors and (b) create a new
> > Kibana
> > >>  dashboard to visualize them. The benefit would be that errors and
> > >>  validation failures would be easier to see and analyze.
> > >>
> > >>  I am seeking feedback on the following:
> > >>
> > >>  - How granular would we want this feature to be? Think we would want
> > one
> > >>  index/dashboard per source? Or would it be better to collapse
> > everything
> > >>  into the same index?
> > >>  - Do we care about storing these errors in HDFS as well? Or is
> indexing
> > >>  them enough?
> > >>  - What types of errors should we record? I am proposing:
> > >>
> > >>  For error reporting:
> > >>  --Message failed to parse
> > >>  --Enrichment failed to enrich
> > >>  --Threat intel feed failures
> > >>  --Generic catch-all for all other errors
> > >>
> > >>  For validation reporting:
> > >>  --What part of message failed validation
> > >>  --What stellar validator caused the failure
> > >>
> > >>  -------------------
> > >>  Thank you,
> > >>
> > >>  James Sirota
> > >>  PPMC- Apache Metron (Incubating)
> > >>  jsirota AT apache DOT org
> > >>
> > >>  --
> > >>
> > >>  Jon
> > >>
> > >>  Sent from my mobile device
> > >
> > > -------------------
> > > Thank you,
> > >
> > > James Sirota
> > > PPMC- Apache Metron (Incubating)
> > > jsirota AT apache DOT org
> > >
> > > --
> > >
> > > Jon
> > >
> > > Sent from my mobile device
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
> --
>
> Jon
>
> Sent from my mobile device
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message