metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zeolla@GMail.com" <zeo...@gmail.com>
Subject Re: [DISCUSS] Error Indexing
Date Thu, 26 Jan 2017 03:37:19 GMT
Although hashing the whole message is better than nothing, it misses a lot
of the benefits we could get.

While I'd love to have consistency for this field across all of the
different error.types, it appears that may not be reasonably possible
because of the parsers.  So, how about something like hash all of the constant
fields
<https://github.com/apache/incubator-metron/blob/master/metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java>
excluding
timestamp and original_string unless it is a parser, in which case hash the
entire message?  This gives us some measure of event uniqueness and it can
grow as we define additional constant fields (I recall discussing with
someone else on the list regarding expanding those standard fields to
include things like usernames but I can't find the specific email exchange).

Because some enrichments can be heavily relied on, I think it makes sense
to put a message onto the error queue when it throws an exception.  Not
only does this help troubleshoot edge cases, but it makes issues more
obvious when assembling a new enrichment in dev/test.  I can't think of a
scenario currently where an enrichment would only be "best effort" and that
I wouldn't want that error indexed and retrievable.  However, this gets
interesting when talking about the various options to solve the "Enrich
enrichment" discussion from earlier in the month.  We can keep that part of
this separate though, as I don't think that's being actively pursued right
now.

Jon

On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dlyle65535@gmail.com> wrote:

RE: separate JIRA for MPack/Ansible. No objection to tracking them
separately, but for this item to be complete, you'll need both the feature
and the ability to install it.

-D...


On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <merrimanr@gmail.com> wrote:

> Assuming we're going to write all errors to a single error topic, I think
> it makes sense to agree on an error message schema and handle errors
across
> the 3 different topologies in the same way with a single implementation.
> The implementation in ParserBolt (ErrorUtils.handleError) produces the
most
> verbose error object so I think it's a good candidate for the single
> implementation.  Here is the message structure it currently produces:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error"
> }
>
> From our discussion so far we need to add a couple fields:  an error type
> and hash id.  Adding these to the message looks like:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error",
>   "error.type": "parser_error",
>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> }
>
> We should also consider expanding the error types I listed earlier.
> Instead of just having "indexing_error" we could have
> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
>
> Jon, if an exception happens in an enrichment or threat intel bolt the
> message is passed along with no error thrown (only logged).  Everywhere
> else I'm having trouble identifying specific fields that should be hashed.
> Would hashing the message in every case be acceptable?  Do you know of a
> place where we could hash a field instead?  On the topic of exceptions in
> enrichments, are we ok with an error only being logged and not added to
the
> message or emitted to the error queue?
>
>
>
> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <merrimanr@gmail.com>
> wrote:
>
> > That use case makes sense to me.  I don't think it will require that
much
> > additional effort either.
> >
> > On Tue, Jan 24, 2017 at 1:02 PM, Zeolla@GMail.com <zeolla@gmail.com>
> > wrote:
> >
> >> Regarding error vs validation - Either way I'm not very concerned.  I
> >> initially assumed they would be combined and agree with that approach,
> but
> >> splitting them out isn't a very big deal to me either.
> >>
> >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> else
> >> where it's not possible to pick out the exact thing causing the issue)
> it
> >> would be a hash of the complete message.
> >>
> >> Regarding the architecture, I mostly agree with James except that I
> think
> >> step 3 needs to also be able to somehow group errors via the original
> >> data (identify
> >> replays, identify repeat issues with data in a specific field, issues
> with
> >> consistently different data, etc.).  This is essentially the first step
> of
> >> troubleshooting, which I assume you are doing if you're looking at the
> >> error dashboard.
> >>
> >> If the hash gets moved out of the initial implementation, I'm fairly
> >> certain you lose this ability.  The point here isn't to handle long
> fields
> >> (although that's a benefit of this approach), it's to attach a unique
> >> identifier to the error/validation issue message that links it to the
> >> original problem.  I'd be happy to consider alternative solutions to
> this
> >> problem (for instance, actually sending across the data itself) I just
> >> haven't been able to think of another way to do this that I like
better.
> >>
> >> Jon
> >>
> >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <merrimanr@gmail.com>
> >> wrote:
> >>
> >> > We also need a JIRA for any install/Ansible/MPack work needed.
> >> >
> >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsirota@apache.org>
> >> wrote:
> >> >
> >> > > Now that I had some time to think about it I would collapse all
> error
> >> and
> >> > > validation topics into one.  We can differentiate between different
> >> views
> >> > > of the data (split by error source etc) via Kibana dashboards.  I
> >> would
> >> > > implement this feature incrementally.  First I would modify all the
> >> bolts
> >> > > to log to a single topic.  Second, I would get the error indexing
> >> done by
> >> > > attaching the indexing topology to the error topic. Third I would
> >> create
> >> > > the necessary dashboards to view errors and validation failures by
> >> > source.
> >> > > Lastly, I would file a follow-on JIRA to introduce hashing of
errors
> >> or
> >> > > fields that are too long.  It seems like a separate feature that we
> >> need
> >> > to
> >> > > think through.  We may need a stellar function around that.
> >> > >
> >> > > Thanks,
> >> > > James
> >> > >
> >> > > 24.01.2017, 10:25, "Ryan Merriman" <merrimanr@gmail.com>:
> >> > > > I understand what Jon is talking about. He's proposing we hash
the
> >> > value
> >> > > > that caused the error, not necessarily the error message itself.
> >> For an
> >> > > > enrichment this is easy. Just pass along the field value that
> failed
> >> > > > enrichment. For other cases the field that caused the error may
> not
> >> be
> >> > so
> >> > > > obvious. Take parser validation for example. The message is
> >> validated
> >> > as
> >> > > > a whole and it may not be easy to determine which field is the
> >> cause.
> >> > In
> >> > > > that case would a hash of the whole message work?
> >> > > >
> >> > > > There is a broader architectural discussion that needs to happen
> >> before
> >> > > we
> >> > > > can implement this. Currently we have an indexing topology that
> >> reads
> >> > > from
> >> > > > 1 topic and writes messages to ES but errors are written to
> several
> >> > > > different topics:
> >> > > >
> >> > > >    - parser_error
> >> > > >    - parser_invalid
> >> > > >    - enrichments_error
> >> > > >    - threatintel_error
> >> > > >    - indexing_error
> >> > > >
> >> > > > I can see 4 possible approaches to implementing this:
> >> > > >
> >> > > >    1. Create an index topology for each error topic
> >> > > >       1. Good because we can easily reuse the indexing topology
> and
> >> > would
> >> > > >       require the least development effort
> >> > > >       2. Bad because it would consume a lot of extra worker slots
> >> > > >    2. Move the topic name into the error JSON message as a new
> >> > > "error_type"
> >> > > >    field and write all messages to the indexing topic
> >> > > >       1. Good because we don't need to create a new topology
> >> > > >       2. Bad because we would be flowing data and errors through
> the
> >> > same
> >> > > >       topology. A spike in errors could affect message indexing.
> >> > > >    3. Compromise between 1 and 2. Create another indexing
topology
> >> that
> >> > > is
> >> > > >    dedicated to indexing errors. Move the topic name into the
> error
> >> > JSON
> >> > > >    message as a new "error_type" field and write all errors to
a
> >> single
> >> > > error
> >> > > >    topic.
> >> > > >    4. Write a completely new topology with multiple spouts (1
for
> >> each
> >> > > >    error type listed above) that all feed into a single
> >> > > BulkMessageWriterBolt.
> >> > > >       1. Good because the current topologies would not need to
> >> change
> >> > > >       2. Bad because it would require the most development
effort,
> >> > would
> >> > > >       not reuse existing topologies and takes up more worker
slots
> >> > than 3
> >> > > >
> >> > > > Are there other approaches I haven't thought of? I think 1 and
2
> are
> >> > off
> >> > > > the table because they are shortcuts and not good long-term
> >> solutions.
> >> > 3
> >> > > > would be my choice because it introduces less complexity than
4.
> >> > > Thoughts?
> >> > > >
> >> > > > Ryan
> >> > > >
> >> > > > On Mon, Jan 23, 2017 at 5:44 PM, Zeolla@GMail.com <
> zeolla@gmail.com
> >> >
> >> > > wrote:
> >> > > >
> >> > > >>  In that case the hash would be of the value in the IP field,
> such
> >> as
> >> > > >>  sha3(8.8.8.8).
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsirota@apache.org>
> >> > wrote:
> >> > > >>
> >> > > >>  > Jon,
> >> > > >>  >
> >> > > >>  > I am still not entirely following why we would want
to use
> >> hashing.
> >> > > For
> >> > > >>  > example if my error is "Your IP field is invalid and
failed
> >> > > validation"
> >> > > >>  > hashing this error string will always result in the
same
hash.
> >> Why
> >> > > not
> >> > > >>  > just use the actual error string? Can you provide an
example
> >> where
> >> > > you
> >> > > >>  > would use it?
> >> > > >>  >
> >> > > >>  > Thanks,
> >> > > >>  > James
> >> > > >>  >
> >> > > >>  > 23.01.2017, 16:29, "Zeolla@GMail.com" <zeolla@gmail.com>:
> >> > > >>  > > For 1 - I'm good with that.
> >> > > >>  > >
> >> > > >>  > > I'm talking about hashing the relevant content
itself not
> the
> >> > > error.
> >> > > >>  Some
> >> > > >>  > > benefits are (1) minimize load on search index
(there's
> >> minimal
> >> > > benefit
> >> > > >>  > in
> >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> >> (tokenize
> >> > and
> >> > > >>  > store))
> >> > > >>  > > (2) provide something to key on for dashboards
(assuming a
> >> good
> >> > > hash
> >> > > >>  > > algorithm that avoids collisions and is second
preimage
> >> > resistant)
> >> > > and
> >> > > >>  > (3)
> >> > > >>  > > specific to errors, if the issue is that it failed
to
> index, a
> >> > hash
> >> > > >>  gives
> >> > > >>  > > us some protection that the issue will not occur
twice.
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> >> jsirota@apache.org>
> >> > > wrote:
> >> > > >>  > >
> >> > > >>  > > Jon,
> >> > > >>  > >
> >> > > >>  > > With regards to 1, collapsing to a single dashboard
for
each
> >> > would
> >> > > be
> >> > > >>  > > fine. So we would have one error index and one
"failed to
> >> > validate"
> >> > > >>  > > index. The distinction is that errors would be
things that
> >> went
> >> > > wrong
> >> > > >>  > > during stream processing (failed to parse, etc...),
while
> >> > > validation
> >> > > >>  > > failures are messages that explicitly failed stellar
> >> > > validation/schema
> >> > > >>  > > enforcement. There should be relatively few of
the second
> >> type.
> >> > > >>  > >
> >> > > >>  > > With respect to 3, why do you want the error hashed?
Why
not
> >> just
> >> > > >>  search
> >> > > >>  > > for the error text?
> >> > > >>  > >
> >> > > >>  > > Thanks,
> >> > > >>  > > James
> >> > > >>  > >
> >> > > >>  > > 20.01.2017, 14:01, "Zeolla@GMail.com" <zeolla@gmail.com>:
> >> > > >>  > >> As someone who currently fills the platform
engineer role,
> I
> >> can
> >> > > give
> >> > > >>  > this
> >> > > >>  > >> idea a huge +1. My thoughts:
> >> > > >>  > >>
> >> > > >>  > >> 1. I think it depends on exactly what data
is pushed into
> the
> >> > > index
> >> > > >>  > (#3).
> >> > > >>  > >> However, assuming the errors you proposed
recording, I
> can't
> >> see
> >> > > huge
> >> > > >>  > >> benefits to having more than one dashboard.
I would be
> happy
> >> to
> >> > be
> >> > > >>  > >> persuaded otherwise.
> >> > > >>  > >>
> >> > > >>  > >> 2. I would say yes, storing the errors in
HDFS in addition
> to
> >> > > >>  indexing
> >> > > >>  > is
> >> > > >>  > >> a good thing. Using METRON-510
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510>
as a
> case
> >> > > study,
> >> > > >>  > there
> >> > > >>  > >> is the potential in this environment for
> attacker-controlled
> >> > data
> >> > > to
> >> > > >>  > >
> >> > > >>  > > result
> >> > > >>  > >> in processing errors which could be a method
of evading
> >> security
> >> > > >>  > >> monitoring. Once an attack is identified,
the long term
> HDFS
> >> > > storage
> >> > > >>  > would
> >> > > >>  > >> allow better historical analysis for
> low-and-slow/persistent
> >> > > attacks
> >> > > >>  > (I'm
> >> > > >>  > >> thinking of a method of data exfil that also
won't
> >> successfully
> >> > > get
> >> > > >>  > stored
> >> > > >>  > >> in Lucene, but is hard to identify over a
short period of
> >> time).
> >> > > >>  > >> - Along this line, I think that there are
various parts of
> >> > Metron
> >> > > >>  > (this
> >> > > >>  > >> included) which could benefit from having
method of
> >> configuring
> >> > > data
> >> > > >>  > aging
> >> > > >>  > >> by bucket in HDFS (Following Nick's comments
here
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >> > > >>  > >>
> >> > > >>  > >> 3. I would potentially add a hash of the content
that
> failed
> >> > > >>  > validation to
> >> > > >>  > >> help identify repeats over time with less
of a concern
that
> >> > you'd
> >> > > >>  have
> >> > > >>  > >
> >> > > >>  > > back
> >> > > >>  > >> to back failures (i.e. instead of storing
the value
> itself).
> >> > > >>  > Additionally,
> >> > > >>  > >> I think it's helpful to be able to search
all times there
> >> was an
> >> > > >>  > indexing
> >> > > >>  > >> error (instead of it hitting the catch-all).
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota
<
> >> > jsirota@apache.org>
> >> > > >>  > wrote:
> >> > > >>  > >>
> >> > > >>  > >> We already have a capability to capture bolt
errors and
> >> > validation
> >> > > >>  > errors
> >> > > >>  > >> and pipe them into a Kafka topic. I want to
propose that
we
> >> > > attach a
> >> > > >>  > >> writer topology to the error and validation
failed kafka
> >> topics
> >> > so
> >> > > >>  > that we
> >> > > >>  > >> can (a) create a new ES index for these errors
and (b)
> >> create a
> >> > > new
> >> > > >>  > Kibana
> >> > > >>  > >> dashboard to visualize them. The benefit would
be that
> errors
> >> > and
> >> > > >>  > >> validation failures would be easier to see
and analyze.
> >> > > >>  > >>
> >> > > >>  > >> I am seeking feedback on the following:
> >> > > >>  > >>
> >> > > >>  > >> - How granular would we want this feature
to be? Think we
> >> would
> >> > > want
> >> > > >>  > one
> >> > > >>  > >> index/dashboard per source? Or would it be
better to
> collapse
> >> > > >>  > everything
> >> > > >>  > >> into the same index?
> >> > > >>  > >> - Do we care about storing these errors in
HDFS as well?
Or
> >> is
> >> > > >>  indexing
> >> > > >>  > >> them enough?
> >> > > >>  > >> - What types of errors should we record? I
am proposing:
> >> > > >>  > >>
> >> > > >>  > >> For error reporting:
> >> > > >>  > >> --Message failed to parse
> >> > > >>  > >> --Enrichment failed to enrich
> >> > > >>  > >> --Threat intel feed failures
> >> > > >>  > >> --Generic catch-all for all other errors
> >> > > >>  > >>
> >> > > >>  > >> For validation reporting:
> >> > > >>  > >> --What part of message failed validation
> >> > > >>  > >> --What stellar validator caused the failure
> >> > > >>  > >>
> >> > > >>  > >> -------------------
> >> > > >>  > >> Thank you,
> >> > > >>  > >>
> >> > > >>  > >> James Sirota
> >> > > >>  > >> PPMC- Apache Metron (Incubating)
> >> > > >>  > >> jsirota AT apache DOT org
> >> > > >>  > >>
> >> > > >>  > >> --
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> Sent from my mobile device
> >> > > >>  > >
> >> > > >>  > > -------------------
> >> > > >>  > > Thank you,
> >> > > >>  > >
> >> > > >>  > > James Sirota
> >> > > >>  > > PPMC- Apache Metron (Incubating)
> >> > > >>  > > jsirota AT apache DOT org
> >> > > >>  > >
> >> > > >>  > > --
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > Sent from my mobile device
> >> > > >>  >
> >> > > >>  > -------------------
> >> > > >>  > Thank you,
> >> > > >>  >
> >> > > >>  > James Sirota
> >> > > >>  > PPMC- Apache Metron (Incubating)
> >> > > >>  > jsirota AT apache DOT org
> >> > > >>  >
> >> > > >>  --
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  Sent from my mobile device
> >> > >
> >> > > -------------------
> >> > > Thank you,
> >> > >
> >> > > James Sirota
> >> > > PPMC- Apache Metron (Incubating)
> >> > > jsirota AT apache DOT org
> >> > >
> >> >
> >> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
> >
> >
>

-- 

Jon

Sent from my mobile device

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message