metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Opinionated Data Flows
Date Tue, 11 Oct 2016 17:46:54 GMT
>
>
> I disagree with the idea that Metron should not be responsible for defining
> data flows and I think that conflicts with the idea of abstracting out the
> CEP component (Storm, Flink, etc).


When I say that a user should be able to define the data flow, I don't mean
that in terms of the underlying implementation; aka topologies.  I mean
that from a user's perspective.  A user should be able to define the
sequence of validations, transformation, and enrichments that occur (or do
not occur).

Maybe I over-generalized in my rant around the data flow.  There are two
concerns that led me to this idea of allowing a user to define the data
flow.


(1) The first is from the user's perspective.  Users need to have enough
power and expressiveness to easily capture, transform, enrich and act on
the data that exists in their environment.

Another good concrete example of this popped up today.  Casey just opened
METRON-496, that I believe also highlights the problem.

*METRON-496: Field transformations are applied after validation, which
means that the validation cannot be affected by the transformations.
Consider a situation where you get a timestamp field in as a string and the
parser validation expects a long.  Conversion could be done as part of a
field transformation, whereas now it would fail validation.*


Based on our current topology design, we have effectively "hard coded" that
validations occur prior to transformations.  This effectively limits what a
user can do.  How can we not do this to the user?  Isn't there some way
that we can allow the user to define the sequence of transformations,
validations, and enrichments?


(2) My second concern is more from the developer's perspective.  Most of
the functionality we have, is in some way dependent on the topology that it
is used in.  We have useful bits of functionality (think Stellar
transforms, Geo enrichment, etc) that are closely coupled with our
topologies.

A good example of this being that I could not reuse the existing "writer"
code base when implementing the Profiler.  The "writer" code base has lots
of references to the topology and sensor type; concepts that do not exist
to the Profiler.  This should all be factored out.  A writer should not
occur in which topology or for what sensor type it is being used.

Properly containing these concepts makes the code more reusable. An example
of how this could look is the HBaseBolt and HBaseMapper in 'metron-hbase'.
This allows any topology to write data to HBase.  There is nothing in that
code that ties it to a specific topology or sensor type.





On Mon, Oct 10, 2016 at 12:49 PM, Ryan Merriman <merrimanr@gmail.com> wrote:

> I think this is a great discussion.  I especially like the DSL examples
> that are given and think we should expand on that.  The good news is that
> we are not far away from being able to actually implement it.  It's just a
> matter of transforming that syntax into the zookeeper configs that drive
> the topologies.  I think the underlying issue here is that the zookeeper
> configs are not intuitive and are hard to work with.  Making them simpler
> or adding a layer on top that makes them simpler is necessary in my
> opinion.
>
> As for the edge cases that have come up and are mentioned in this
> thread ("parse
> heterogenous data from a single topic" and "enriched output to land in
> unique topics by sensor type"), a simple enhancement could solve both of
> those.  Right now the output topic for parser and enrichment topologies are
> either passed in when building the topology (flux or constructor args) or
> retrieved from zookeeper.  This limits you to 1 output topic per topology.
> Expanding the KafkaWriter class to optionally pull the output topic from a
> field in a parsed message or have it passed in as an input parameter to the
> write method should make it flexible enough to route messages to different
> topics.  Also this statement is not entirely true:  "You cannot use the
> output of one enrichment as the input to another".  You can if you use a
> Stellar enrichment bolt and HBase enrichments.  Geo and host enrichments
> would either need to be exposed through Stellar, or even better, converted
> to HBase enrichments.
>
> I disagree with the idea that Metron should not be responsible for defining
> data flows and I think that conflicts with the idea of abstracting out the
> CEP component (Storm, Flink, etc).  There are patterns that emerge and
> tricks the community finds through experience that should be baked in.  An
> example of this is the enrichment topologies.  Grouping messages together
> by enrichment keys before enrichment allows us to put a caching layer in
> front which lightens the load on HBase and makes enrichment more
> efficient.  If we put the responsibility of defining topologies on the
> user, now they have to be an expert in tuning whatever CEP is chosen as
> well as be knowledgable of established design patterns.  Maybe the current
> state of Metron requires Storm tuning expertise anyways but I think we
> should trend away from that and evolve Metron to be more capable of making
> intelligent choices automatically.  I remember the early days of Hive
> required careful consideration when writing queries to ensure the correct
> joins where used, data was distributed evenly, etc.  Tuning Hive is easier
> now because it has evolved to be able to make more of these choices
> automatically without requiring users to have detailed knowledge of how
> things work internally.
>
> Ryan Merriman
>
> On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <nick@nickallen.org> wrote:
>
> > Whether it is explicit or implicit, I think that would be one of the
> major
> > benefits of having the expressiveness of a DSL.  I can choose to have
> some
> > enrichments run in parallel (the split/join that you are referring to) or
> > have some enrichment runs serially.
> >
> > Having enrichments run serially is not something you can easily do with
> > Metron today.  You cannot use the output of one enrichment as the input
> to
> > another.
> >
> > As a simple example, I have a blacklist of countries for which my
> > organization should not be doing business.  I need to use the IP to find
> > the location and then use the location to match against a blacklist.  I
> > need these enrichments to run serially.
> >
> > source("netflow")
> >   -> parser("Netflow")
> >   -> exists("ip_src_addr")
> >   -> src_country = geo["ip_src_addr"].country
> >   -> is_alert = blacklist["src_country"]
> >   ...
> >
> >
> >
> >
> > On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfoley@hortonworks.com>
> wrote:
> >
> > > Would splitting and joining be implicit or explicit, for multi-path
> > > topologies?
> > > ________________________________________
> > > From: Zeolla@GMail.com <zeolla@gmail.com>
> > > Sent: Thursday, October 06, 2016 11:03 AM
> > > To: dev@metron.incubator.apache.org
> > > Subject: Re: [DISCUSS] Opinionated Data Flows
> > >
> > > It should also be smart enough to handle an order like:
> > >
> > > source("bro")
> > >   -> parser("BasicBroParser")
> > >   -> exists("ip_src_addr")
> > >   -> geo_ip_src = geo["ip_src_addr"]
> > >   -> application = assets["ip_src_addr"].application
> > >   -> owner = assets["ip_src_addr"].owner
> > >   -> exists("ip_dst_addr")
> > >   -> geo_ip_dst = geo["ip_dst_addr"]
> > >   -> elasticsearch("bro-index")
> > >
> > > Without duplicate hits of the topologies.
> > >
> > > Jon
> > >
> > > On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <nick@nickallen.org> wrote:
> > >
> > > > Here is quick example with some hypothetical syntax.  Whatever that
> > > syntax
> > > > might be, it would be very simple, easy to understand, and leverage
> > > > high-level concepts specific to Metron.
> > > >
> > > > This flow consumes Bro data, ensures there are valid
> source/destination
> > > > IPs, performs geo-enrichment, asset enrichment and finally persists
> the
> > > > data in Elasticsearch.
> > > >
> > > >
> > > > source("bro")
> > > >   -> parser("BasicBroParser")
> > > >   -> exists("ip_src_addr")
> > > >   -> exists("ip_dst_addr")
> > > >   -> geo_ip_src = geo["ip_src_addr"]
> > > >   -> geo_ip_dst = geo["ip_dst_addr"]
> > > >   -> application = assets["ip_src_addr"].application
> > > >   -> owner = assets["ip_src_addr"].owner
> > > >   -> elasticsearch("bro-index")
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <nick@nickallen.org>
> > wrote:
> > > >
> > > > > Chasing this bad idea down even further leads me to something even
> > > > > crazier.
> > > > >
> > > > > Stellar 1.0 can only operate within a single topology and in most
> > cases
> > > > > only on a single message.  Stellar 2.0 could be the mechanism that
> > > allows
> > > > > users to define their own data flows and what "useful bits of
> Metron
> > > > > functionality" get plugged-in.
> > > > >
> > > > > Once, you have a DSL that allows users to define what they want
> > Metron
> > > to
> > > > > do, then the underlying implementation mechanism (which is
> currently
> > > > Storm)
> > > > > can also be swapped-out.  If we have an even faster Storm
> > > implementation,
> > > > > then we swap in the Storm NG engine.  Maybe we want Metron to also
> > run
> > > in
> > > > > Flink, then we just swap-in a Flink engine.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <nick@nickallen.org>
> > > wrote:
> > > > >
> > > > >> I totally "bird dogged the previous thread" as Casey likes to
call
> > it.
> > > > :)
> > > > >>  I am extracting this thought into a separate thread before I
> start
> > > > >> throwing out even more, crazier ideas.
> > > > >>
> > > > >> In general, Metron is very opinionated about data flows right
now.
> > We
> > > > >>> have Parser topologies that feed an Enrichment topology,
which
> then
> > > > feeds
> > > > >>> an Indexing topology.  We have useful bits of functionality
> (think
> > > > Stellar
> > > > >>> transforms, Geo enrichment, etc) that are closely coupled
with
> > these
> > > > >>> topologies (aka data flows).
> > > > >>>
> > > > >>
> > > > >>
> > > > >>> When a user wants to parse heterogenous data from a single
topic,
> > > > that's
> > > > >>> not easy.  When a user wants enriched output to land in unique
> > topics
> > > > by
> > > > >>> sensor type, well, that's also not easy.    When a user wanted
to
> > > skip
> > > > >>> enrichment of data sources, we actually re-architected the
data
> > flow
> > > > to add
> > > > >>> the Indexing topology.
> > > > >>>
> > > > >>
> > > > >>
> > > > >>> In an ideal world, a user should be responsible for defining
the
> > data
> > > > >>> flow, not Metron.  Metron should provide the "useful bits
of
> > > > functionality"
> > > > >>> that a user can "plugin" wherever they like.  Metron itself
> should
> > > not
> > > > care
> > > > >>> how the data is moving or what step in the process it is
at.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Nick Allen <nick@nickallen.org>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Nick Allen <nick@nickallen.org>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Nick Allen <nick@nickallen.org>
> > > >
> > > --
> > >
> > > Jon
> > >
> >
> >
> >
> > --
> > Nick Allen <nick@nickallen.org>
> >
>



-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message