metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Merriman <merrim...@gmail.com>
Subject Re: [DISCUSS] Opinionated Data Flows
Date Mon, 10 Oct 2016 16:49:10 GMT
I think this is a great discussion.  I especially like the DSL examples
that are given and think we should expand on that.  The good news is that
we are not far away from being able to actually implement it.  It's just a
matter of transforming that syntax into the zookeeper configs that drive
the topologies.  I think the underlying issue here is that the zookeeper
configs are not intuitive and are hard to work with.  Making them simpler
or adding a layer on top that makes them simpler is necessary in my
opinion.

As for the edge cases that have come up and are mentioned in this
thread ("parse
heterogenous data from a single topic" and "enriched output to land in
unique topics by sensor type"), a simple enhancement could solve both of
those.  Right now the output topic for parser and enrichment topologies are
either passed in when building the topology (flux or constructor args) or
retrieved from zookeeper.  This limits you to 1 output topic per topology.
Expanding the KafkaWriter class to optionally pull the output topic from a
field in a parsed message or have it passed in as an input parameter to the
write method should make it flexible enough to route messages to different
topics.  Also this statement is not entirely true:  "You cannot use the
output of one enrichment as the input to another".  You can if you use a
Stellar enrichment bolt and HBase enrichments.  Geo and host enrichments
would either need to be exposed through Stellar, or even better, converted
to HBase enrichments.

I disagree with the idea that Metron should not be responsible for defining
data flows and I think that conflicts with the idea of abstracting out the
CEP component (Storm, Flink, etc).  There are patterns that emerge and
tricks the community finds through experience that should be baked in.  An
example of this is the enrichment topologies.  Grouping messages together
by enrichment keys before enrichment allows us to put a caching layer in
front which lightens the load on HBase and makes enrichment more
efficient.  If we put the responsibility of defining topologies on the
user, now they have to be an expert in tuning whatever CEP is chosen as
well as be knowledgable of established design patterns.  Maybe the current
state of Metron requires Storm tuning expertise anyways but I think we
should trend away from that and evolve Metron to be more capable of making
intelligent choices automatically.  I remember the early days of Hive
required careful consideration when writing queries to ensure the correct
joins where used, data was distributed evenly, etc.  Tuning Hive is easier
now because it has evolved to be able to make more of these choices
automatically without requiring users to have detailed knowledge of how
things work internally.

Ryan Merriman

On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <nick@nickallen.org> wrote:

> Whether it is explicit or implicit, I think that would be one of the major
> benefits of having the expressiveness of a DSL.  I can choose to have some
> enrichments run in parallel (the split/join that you are referring to) or
> have some enrichment runs serially.
>
> Having enrichments run serially is not something you can easily do with
> Metron today.  You cannot use the output of one enrichment as the input to
> another.
>
> As a simple example, I have a blacklist of countries for which my
> organization should not be doing business.  I need to use the IP to find
> the location and then use the location to match against a blacklist.  I
> need these enrichments to run serially.
>
> source("netflow")
>   -> parser("Netflow")
>   -> exists("ip_src_addr")
>   -> src_country = geo["ip_src_addr"].country
>   -> is_alert = blacklist["src_country"]
>   ...
>
>
>
>
> On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfoley@hortonworks.com> wrote:
>
> > Would splitting and joining be implicit or explicit, for multi-path
> > topologies?
> > ________________________________________
> > From: Zeolla@GMail.com <zeolla@gmail.com>
> > Sent: Thursday, October 06, 2016 11:03 AM
> > To: dev@metron.incubator.apache.org
> > Subject: Re: [DISCUSS] Opinionated Data Flows
> >
> > It should also be smart enough to handle an order like:
> >
> > source("bro")
> >   -> parser("BasicBroParser")
> >   -> exists("ip_src_addr")
> >   -> geo_ip_src = geo["ip_src_addr"]
> >   -> application = assets["ip_src_addr"].application
> >   -> owner = assets["ip_src_addr"].owner
> >   -> exists("ip_dst_addr")
> >   -> geo_ip_dst = geo["ip_dst_addr"]
> >   -> elasticsearch("bro-index")
> >
> > Without duplicate hits of the topologies.
> >
> > Jon
> >
> > On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <nick@nickallen.org> wrote:
> >
> > > Here is quick example with some hypothetical syntax.  Whatever that
> > syntax
> > > might be, it would be very simple, easy to understand, and leverage
> > > high-level concepts specific to Metron.
> > >
> > > This flow consumes Bro data, ensures there are valid source/destination
> > > IPs, performs geo-enrichment, asset enrichment and finally persists the
> > > data in Elasticsearch.
> > >
> > >
> > > source("bro")
> > >   -> parser("BasicBroParser")
> > >   -> exists("ip_src_addr")
> > >   -> exists("ip_dst_addr")
> > >   -> geo_ip_src = geo["ip_src_addr"]
> > >   -> geo_ip_dst = geo["ip_dst_addr"]
> > >   -> application = assets["ip_src_addr"].application
> > >   -> owner = assets["ip_src_addr"].owner
> > >   -> elasticsearch("bro-index")
> > >
> > >
> > >
> > >
> > > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <nick@nickallen.org>
> wrote:
> > >
> > > > Chasing this bad idea down even further leads me to something even
> > > > crazier.
> > > >
> > > > Stellar 1.0 can only operate within a single topology and in most
> cases
> > > > only on a single message.  Stellar 2.0 could be the mechanism that
> > allows
> > > > users to define their own data flows and what "useful bits of Metron
> > > > functionality" get plugged-in.
> > > >
> > > > Once, you have a DSL that allows users to define what they want
> Metron
> > to
> > > > do, then the underlying implementation mechanism (which is currently
> > > Storm)
> > > > can also be swapped-out.  If we have an even faster Storm
> > implementation,
> > > > then we swap in the Storm NG engine.  Maybe we want Metron to also
> run
> > in
> > > > Flink, then we just swap-in a Flink engine.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <nick@nickallen.org>
> > wrote:
> > > >
> > > >> I totally "bird dogged the previous thread" as Casey likes to call
> it.
> > > :)
> > > >>  I am extracting this thought into a separate thread before I start
> > > >> throwing out even more, crazier ideas.
> > > >>
> > > >> In general, Metron is very opinionated about data flows right now.
> We
> > > >>> have Parser topologies that feed an Enrichment topology, which
then
> > > feeds
> > > >>> an Indexing topology.  We have useful bits of functionality (think
> > > Stellar
> > > >>> transforms, Geo enrichment, etc) that are closely coupled with
> these
> > > >>> topologies (aka data flows).
> > > >>>
> > > >>
> > > >>
> > > >>> When a user wants to parse heterogenous data from a single topic,
> > > that's
> > > >>> not easy.  When a user wants enriched output to land in unique
> topics
> > > by
> > > >>> sensor type, well, that's also not easy.    When a user wanted
to
> > skip
> > > >>> enrichment of data sources, we actually re-architected the data
> flow
> > > to add
> > > >>> the Indexing topology.
> > > >>>
> > > >>
> > > >>
> > > >>> In an ideal world, a user should be responsible for defining the
> data
> > > >>> flow, not Metron.  Metron should provide the "useful bits of
> > > functionality"
> > > >>> that a user can "plugin" wherever they like.  Metron itself should
> > not
> > > care
> > > >>> how the data is moving or what step in the process it is at.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nick Allen <nick@nickallen.org>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Nick Allen <nick@nickallen.org>
> > > >
> > >
> > >
> > >
> > > --
> > > Nick Allen <nick@nickallen.org>
> > >
> > --
> >
> > Jon
> >
>
>
>
> --
> Nick Allen <nick@nickallen.org>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message