metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Nazemian <alinazem...@gmail.com>
Subject Re: Normalization topology or separate normalization bolt for parsing topology
Date Wed, 03 May 2017 01:05:46 GMT
Hi Nick,

I am happy to continue the development using the current architecture and
embed the pre-parsing steps in the parser code. However, this would be
against the policy to have a contribution to Metron community to expand the
range of supported devices. Clearly, a generic parser would be useful for
the community not a type of parser that is highly customised for our noisy
environment. I was looking for decoupling Parsing and Normalisation to
implement a generic parser which can be used by others as well.

I think this is more a type of strategic decision which can increase the
number of generic parsers that will be contributed back to the community in
future. Ideally, it would be better that official Metron developers focus
on Metron features instead of developing generic parsers.

Thanks,
Ali

On Wed, May 3, 2017 at 3:03 AM, Nick Allen <nick@nickallen.org> wrote:

> Yes, and currently that normalization step is the Parsers.
>
> I am not saying the message has to be entirely clear and well-defined.  But
> there are a minimum set of expectations that you must have of any data that
> you're ingesting.   Once it meets that "minimum set", the parser should be
> able to ingest and normalize the message.  Any oddities beyond that
> "minimum set" can be handled with Stellar either post-Parsing or in
> Enrichment.
>
> It is, of course, a judgement call as to what that minimum set is for you.
> You would just need a Parser that matches your definition of "minimum set".
>
> My main point here is that I am not seeing a need to re-architect
> anything.  I think we have the right tools, IMHO.
>
>
>
>
>
>
>
>
>
> On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <alinazemian@gmail.com>
> wrote:
>
> > Hi Nick,
> >
> > The date could be corrupted due to any reason, and sometimes we haven't
> got
> > any control on the device. Obviously, it is not a big deal if we lose
> <166>
> > severity message, but it could be a different situation for <161>
> > severity or an actual critical threat. However, I have mentioned those
> > defects as an example to pointed the importance of having a normalisation
> > step in Metron processing chain.
> >
> > I still think there is no guarantee to have an entirely clear and
> > well-defined message in the real world use case. If we recognise this
> > situation as a problem, then finding a high performance and flexible
> > solution is not very hard.
> >
> > Cheers,
> > Ali
> >
> > On Tue, May 2, 2017 at 11:24 PM, Nick Allen <nick@nickallen.org> wrote:
> >
> > > Before worrying about how to ingest this 'noisy' data, I would want to
> > > better understand root cause.  If you cannot even get a valid date
> > format,
> > > are you sure the data can be trusted?
> > >
> > > Rather than bending over backwards to try to ingest it, I would first
> > make
> > > sure the telemetry is not totally bogus to begin with.  Maybe it is
> > better
> > > that the data is dropped in cases like this.
> > >
> > > IMHO, that is how I would tackle a problem like this.  Not all data can
> > be
> > > trusted.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <alinazemian@gmail.com>
> > > wrote:
> > >
> > > > Are you sure? The syslog_host name is way more complicated than
> > something
> > > > that can be a coincidence. I need to double check with one of the
> > > security
> > > > device experts, but I thought it is some kind of noises.
> > > >
> > > > Yes, we do have more use cases that seem to be corrupted. For
> example,
> > > > having duplicate IP addresses or corrupted date format. Please have a
> > > look
> > > > at the following message. At least I am sure the date format is
> > corrupted
> > > > in this one.
> > > >
> > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > > connection
> > > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to
> inside:*y.y.y.y/p2*
> > > > *y.y.y.y/p2*
> > > >
> > > > Cheers,
> > > > Ali
> > > >
> > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > > simon@simonellistonball.com> wrote:
> > > >
> > > > > Is that instance, you're looking at valid syslog which should be
> > parsed
> > > > as
> > > > > such. The repeat host is not really a host in syslog terms, it's
an
> > > > > application name header which happens to be the same. This is
> > > definitely
> > > > a
> > > > > parser bug which should be handled, esp since the header is
> perfectly
> > > RFC
> > > > > compliant.
> > > > >
> > > > > Do you have any other such cases? My view is that parsers should
be
> > > > > written with more any case, so should extract all the fields they
> can
> > > > from
> > > > > malformed logs, rather than throwing exceptions, but that's more
> > about
> > > > the
> > > > > way we write parsers than having some kind of pre-clean.
> > > > >
> > > > > Simon
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <alinazemian@gmail.com>
> > > wrote:
> > > > > >
> > > > > > I do agree there is a fair amount of overhead for using another
> > bolt
> > > > for
> > > > > > this purpose. I am not pointing to the way of implementation.
It
> > > might
> > > > > be a
> > > > > > way of implementation to segregate two extension points without
> > > adding
> > > > > > overhead; I haven't thought about it yet. However, the main
issue
> > is
> > > > > > sometimes the type of noise is something that generates an
> > exception
> > > on
> > > > > the
> > > > > > parsing side. For example, have a look at the following log:
> > > > > >
> > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021:
Teardown
> ICMP
> > > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > > (ryanmar)
> > > > > >
> > > > > > Clearly duplicate syslog_host throws an exception on parsing,
so
> > how
> > > > > > are we going to deal with that at post-parse transformation?
It
> > > cannot
> > > > > > pass the parsing. This is only a single example of cases that
> might
> > > > > > affect the production data. Unless Stellar transformation is
> > > something
> > > > > > that can be done at pre-parse and for the entire message.
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > > simon@simonellistonball.com> wrote:
> > > > > >
> > > > > >> Ali,
> > > > > >>
> > > > > >> Sounds very much like what you’re talking about when you
say
> > > > > >> normalization, and what I would understand it as, is the
process
> > > > > fulfilled
> > > > > >> by stellar field transformation in the parser config. Agreed
> that
> > > some
> > > > > of
> > > > > >> these will be general, based on common metron standard schema,
> but
> > > > > others
> > > > > >> will be organisation specific (custom fields overloaded
with
> > > different
> > > > > >> meanings for instance in CEF, for example). These are very
much
> > one
> > > of
> > > > > the
> > > > > >> reasons we have the stellar transformation step. I don’t
think
> > that
> > > > > should
> > > > > >> be moved to a separate bolt to be honest, because that comes
> with
> > a
> > > > fair
> > > > > >> amount of overhead, but logically it is in the parser config
> > rather
> > > > than
> > > > > >> the parser, so seems to serve this purpose in the post-parse
> > > > transform,
> > > > > no?
> > > > > >>
> > > > > >> Simon
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <alinazemian@gmail.com>
> > > > wrote:
> > > > > >>>
> > > > > >>> Hi Simon,
> > > > > >>>
> > > > > >>> The reason I am asking for a specific normalisation
step is due
> > to
> > > > the
> > > > > >> fact
> > > > > >>> that normalisation is not a general use case which can
be used
> by
> > > > other
> > > > > >>> users. It is completely bounded to our application.
The way we
> > have
> > > > > fixed
> > > > > >>> it, for now, is to add a normalisation step to the parser
and
> > clear
> > > > the
> > > > > >>> incoming data so the parser step can work on that, but
I don't
> > like
> > > > it.
> > > > > >>> There is no point of creating a parser that can handle
all of
> the
> > > > > >> possible
> > > > > >>> noises that can exist in the production data. Even if
it is
> > > possible
> > > > to
> > > > > >>> predict every kind of noise in production data there
is no
> point
> > > for
> > > > > >> Metron
> > > > > >>> community to focus on building a general purpose parser
for a
> > > > specific
> > > > > >>> device while they can spend that time on developing
a cool
> > feature.
> > > > > Even
> > > > > >> if
> > > > > >>> it is possible to predict noises and it is acceptable
for the
> > > > community
> > > > > >> to
> > > > > >>> spend their time on creating that kind of parser why
every
> Metron
> > > > user
> > > > > >> need
> > > > > >>> that extra normalisation? A user data might be clear
at the
> first
> > > > step
> > > > > >> and
> > > > > >>> obviously, it only decreases the total throughput without
any
> use
> > > for
> > > > > >> that
> > > > > >>> specific user.
> > > > > >>>
> > > > > >>> Imagine there is an additional bolt for normalisation
and there
> > is
> > > a
> > > > > >>> mechanism to customise the normalisation without changing
the
> > > general
> > > > > >>> parser for a specific device. We can have a general
parser as a
> > > > common
> > > > > >>> parser for that device and leave the normalisation development
> to
> > > > > users.
> > > > > >>> However, it is very important to provide the normalisation
step
> > as
> > > > fast
> > > > > >> as
> > > > > >>> possible.
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Ali
> > > > > >>>
> > > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> > cestella@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Yeah, we definitely don't want to rewrite parsing
in
> Stellar.  I
> > > > would
> > > > > >>>> expect the job of the parser, however, to handle
structural
> > > issues.
> > > > > In
> > > > > >> my
> > > > > >>>> mind, parsing is about transforming structures into
fields and
> > the
> > > > > role
> > > > > >> of
> > > > > >>>> the field transformations are to transform values.
 There's
> > > obvious
> > > > > >> overlap
> > > > > >>>> there wherein parsers may do some
> normalizations/transformations
> > > > (i.e.
> > > > > >> look
> > > > > >>>> how grok handles timestamps), but it almost always
gets us
> into
> > > > > trouble
> > > > > >>>> when parsers do even moderately complex value transformations.
> > > > > >>>>
> > > > > >>>> As I type this, though, I think I see your point.
 What you
> > really
> > > > > want
> > > > > >> is
> > > > > >>>> to chain parsers, have a pre-parser to bring you
80% of the
> way
> > > > there
> > > > > >> and
> > > > > >>>> hammer out all the structural issues so you might
be able to
> > use a
> > > > > more
> > > > > >>>> generic parser down the chain.  I have often thought
that
> maybe
> > we
> > > > > >> should
> > > > > >>>> expose parsers as Stellar functions which take raw
data and
> emit
> > > > whole
> > > > > >>>> messages.  This would allow us to compose parsers,
so imagine
> > the
> > > > > above
> > > > > >>>> example where you've written a stellar function
to normalize
> the
> > > > input
> > > > > >> and
> > > > > >>>> you're then passing it to a CSV parser, you could
run
> > > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd
otherwise
> > > specify a
> > > > > >>>> parser.
> > > > > >>>>
> > > > > >>>> As for speed, the stellar expression would get compiled
into a
> > > java
> > > > > >> object,
> > > > > >>>> so it shouldn't be appreciable overhead since we
no longer lex
> > and
> > > > > parse
> > > > > >>>> for every message.
> > > > > >>>>
> > > > > >>>> Is this kinda how you were seeing it?
> > > > > >>>>
> > > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston
Ball <
> > > > > >>>> simon@simonellistonball.com> wrote:
> > > > > >>>>
> > > > > >>>>> The challenge there I suspect is going to be
that you
> > essentially
> > > > end
> > > > > >> up
> > > > > >>>>> with the actual parser doing very little of
value, and then
> > > > > effectively
> > > > > >>>>> trying to write a parser in stellar against
a few broad
> > strings,
> > > > > which
> > > > > >>>>> would likely give you all sorts of performance
problems.
> > > > > >>>>>
> > > > > >>>>> One solution is to write a very defensive and
flexible
> parser,
> > > but
> > > > > that
> > > > > >>>>> would tend to be time consuming.
> > > > > >>>>>
> > > > > >>>>> There is also something to be said for doing
some basic
> > > > > transformation
> > > > > >>>>> before the parser topic kafka in something like
nifi, but
> > again,
> > > > > >>>>> performance can be an issue there.
> > > > > >>>>>
> > > > > >>>>> If the noise is about broken structure for example,
maybe a
> > > simple
> > > > > >>>>> pre-process step as part of your parser would
make sense,
> e.g.
> > > > > >> stripping
> > > > > >>>>> syslog headers, or character set conversion,
removing very
> > broken
> > > > > bits
> > > > > >> as
> > > > > >>>>> part of the parse method.
> > > > > >>>>>
> > > > > >>>>> In terms of normalisation post-parse, I agree,
that 100% a
> job
> > > for
> > > > > >>>>> Stellar, and the fieldTransformations capability.
Something I
> > > would
> > > > > >> like
> > > > > >>>> to
> > > > > >>>>> see would be a means to use that transformation
step to map
> to
> > a
> > > > well
> > > > > >>>> known
> > > > > >>>>> (though loosely enforced) schema provided by
a governance
> > > > framework,
> > > > > >> but
> > > > > >>>>> that is a much bigger topic of conversation.
> > > > > >>>>>
> > > > > >>>>> Not of course that not everything has to be
parsed just
> because
> > > > it’s
> > > > > in
> > > > > >>>>> the message. A relatively loose fitting parser
which pulls
> out
> > > the
> > > > > >>>> relevant
> > > > > >>>>> data for the use case would be fine, and likely
a lot more
> > > tolerant
> > > > > of
> > > > > >>>>> noise than something that felt the need for
every field. We
> do
> > > > after
> > > > > >> all
> > > > > >>>>> store the original_string for you if you really
absolutely
> have
> > > to
> > > > > had
> > > > > >>>>> everything, so a more schema-on-read philosophy
certainly
> > applies
> > > > and
> > > > > >>>> will
> > > > > >>>>> likely side-step a lot of your issues.
> > > > > >>>>>
> > > > > >>>>> Simon
> > > > > >>>>>
> > > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <cestella@gmail.com>
> > > > wrote:
> > > > > >>>>>>
> > > > > >>>>>> Ok, that's another story.  hmmmm, we don't
generally
> pre-parse
> > > > > becuase
> > > > > >>>> we
> > > > > >>>>>> try to not assume any particular format
there (i.e. it could
> > be
> > > > > >>>> strings,
> > > > > >>>>>> could be byte arrays).  Maybe the right
answer is to pass
> the
> > > raw,
> > > > > >>>>>> non-normalized data (best effort tyep of
thing) through the
> > > parser
> > > > > and
> > > > > >>>> do
> > > > > >>>>>> the normalization post-parse..or is there
a problem with
> that?
> > > > > >>>>>>
> > > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian
<
> > > > > alinazemian@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi Casey,
> > > > > >>>>>>>
> > > > > >>>>>>> It is actually pre-parse process, not
a post-parse one.
> These
> > > > type
> > > > > of
> > > > > >>>>>>> noises affect the position of an attribute
for example and
> > give
> > > > us
> > > > > >>>>> parsing
> > > > > >>>>>>> exception. The timestamp example was
not a good one because
> > > that
> > > > is
> > > > > >>>>>>> actually a post-parse exception.
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey
Stella <
> > > > cestella@gmail.com
> > > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> So, further transformation post-parse
was one of the
> > > motivating
> > > > > >>>> reasons
> > > > > >>>>>>> for
> > > > > >>>>>>>> Stellar (to do that transformation
post-parse).  Is there
> a
> > > > > >>>> capability
> > > > > >>>>>>> that
> > > > > >>>>>>>> it's lacking that we can add to
fit your usecase?
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM,
Ali Nazemian <
> > > > > >> alinazemian@gmail.com
> > > > > >>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> I've created a Jira ticket regarding
this feature.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11
PM, Ali Nazemian <
> > > > > >>>> alinazemian@gmail.com
> > > > > >>>>>>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Currently, we are using
normal regex at the Java source
> > code
> > > > to
> > > > > >>>>>>> handle
> > > > > >>>>>>>>>> those situations. However,
it would be nice to have a
> > > separate
> > > > > >> bolt
> > > > > >>>>>>> and
> > > > > >>>>>>>>>> deal with them separately.
Yeah, I can create a Jira
> issue
> > > > > >>>> regarding
> > > > > >>>>>>>>> that.
> > > > > >>>>>>>>>> The main reason I am asking
for such a feature is the
> fact
> > > > that
> > > > > >>>> lack
> > > > > >>>>>>> of
> > > > > >>>>>>>>>> such a feature makes the
process of creating some parser
> > for
> > > > the
> > > > > >>>>>>>>> community
> > > > > >>>>>>>>>> a little painful for us.
We need to maintain two
> different
> > > > > >>>> versions,
> > > > > >>>>>>>> one
> > > > > >>>>>>>>>> for community another for
the internal use case.
> Clearly,
> > > > noise
> > > > > is
> > > > > >>>> an
> > > > > >>>>>>>>>> inevitable part of real
world use cases.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Cheers,
> > > > > >>>>>>>>>> Ali
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Apr 26, 2017 at
11:04 PM, Otto Fowler <
> > > > > >>>>>>> ottobackwards@gmail.com
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Are you doing this cleansing
all in the parser or are
> you
> > > > using
> > > > > >>>> any
> > > > > >>>>>>>>>>> Stellar to do it?
> > > > > >>>>>>>>>>> Can you create a jira?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On April 26, 2017 at
08:59:16, Ali Nazemian (
> > > > > >>>> alinazemian@gmail.com)
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> We are facing certain
use cases in Metron production
> that
> > > > > happen
> > > > > >>>> to
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>> related to noisy stream.
For example, a wrong
> timestamp,
> > > > > >> duplicate
> > > > > >>>>>>>>>>> hostname/IP address,
etc. To deal with the
> normalization
> > we
> > > > > have
> > > > > >>>>>>> added
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>> additional step for
the corresponding parsers to do the
> > > data
> > > > > >>>>>>> cleaning.
> > > > > >>>>>>>>>>> Clearly, parsing is
a standard factor which is mostly
> > > related
> > > > > to
> > > > > >>>> the
> > > > > >>>>>>>>>>> device
> > > > > >>>>>>>>>>> that is generating the
data and can be used for the
> same
> > > type
> > > > > of
> > > > > >>>>>>>> device
> > > > > >>>>>>>>>>> everywhere, but normalization
is very production
> > dependent
> > > > and
> > > > > >>>> there
> > > > > >>>>>>>> is
> > > > > >>>>>>>>>>> no
> > > > > >>>>>>>>>>> point of mixing normalization
with parsing. It would be
> > > nice
> > > > to
> > > > > >>>>>>> have a
> > > > > >>>>>>>>>>> sperate bolt in a parsing
topologies to dedicate to
> > > > production
> > > > > >>>>>>>>>>> related cleaning process.
In that case, eveybody can
> > easily
> > > > > >>>>>>> contribute
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> Metron community with
additional parsers without being
> > > > worried
> > > > > >>>> about
> > > > > >>>>>>>>>>> mixing
> > > > > >>>>>>>>>>> parsers and data cleaning
process.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Ali
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> --
> > > > > >>>>>>>>> A.Nazemian
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> A.Nazemian
> > > > > >>>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> A.Nazemian
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message