metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Fri, 13 Jan 2017 14:10:14 GMT
Yeah, I tend to like the first option too.  Any opposition to that from
anyone?

The points brought up are good ones and I think that it may be worth a
broader discussion of the requirements of indexing in a separate dev list
thread.  Maybe a list of desires with coherent use-cases justifying them so
we can think about how this stuff should work and where the natural
extension points should be.  Afterall, we need to toe the line between
engineering and overengineering for features nobody will want.

I'm not sure about the extensions to the standard fields.  I'm torn between
the notions that we should have no standard fields vs we should have a
boatload of standard fields (with most of them empty).  I exchange
positions fairly regularly on that question. ;)  It may be worth a dev list
discussion to lay out how you imagine an extension of standard fields and
how it might look as implemented in Metron.

Casey

Casey

On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <kylerichardson2@gmail.com>
wrote:

> I'll second my preference for the first option. I think the ability to use
> Stellar filters to customize indexing would be a big win.
>
> I'm glad Matt brought up the point about data lake and CEP. I think this is
> a really important use case that we need to consider. Take a simple
> example... If I have data coming in from 3 different firewall vendors and 2
> different web proxy/url filtering vendors and I want to be able to analyze
> that data set, I need the data to be indexed all together (likely in HDFS)
> and to have a normalized schema such that IP address, URL, and user name
> (to take a few) can be easily queried and aggregated. I can also envision
> scenarios where I would want to index data based on attributes other than
> sensor, business unit or subsidiary for example.
>
> I've been wanted to propose extending our 7 standard fields to include
> things like URL and user. Is there community interest/support for moving in
> that direction? If so, I'll start a new thread.
>
> Thanks!
>
> -Kyle
>
> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <mattf@apache.org> wrote:
>
> > Ah, I see.  If overriding the default index name allows using the same
> > name for multiple sensors, then the goal can be achieved.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 3:30 PM, "Casey Stella" <cestella@gmail.com> wrote:
> >
> >     Oh, you could!  Let's say you have a syslog parser with data from
> > sources 1
> >     2 and 3.  You'd end up with one kafka queue with 3 parsers attached
> to
> > that
> >     queue, each picking part the messages from source 1, 2 and 3.  They'd
> > go
> >     through separate enrichment and into the indexing topology.  In the
> >     indexing topology, you could specify the same index name "syslog" and
> > all
> >     of the messages go into the same index for CEP querying if so
> desired.
> >
> >     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <mattf@apache.org>
> wrote:
> >
> >     > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > previous
> >     > life.  It makes perfect sense to route different lines from syslog
> > through
> >     > different appropriate parsers.  But a lot of what the parsers do is
> >     > identify consistent subsets of metadata and annotate it – eg,
> > src_ip_addr,
> >     > event timestamps, etc.  Once those metadata are annotated and
> > available
> >     > with common field names, why doesn’t it make sense to index the
> > messages
> >     > together, for CEP querying?  I think Splunk has illustrated this
> > model.
> >     >
> >     > On 1/12/17, 3:00 PM, "Casey Stella" <cestella@gmail.com> wrote:
> >     >
> >     >     yeah, I mean, honestly, I think the approach that we've taken
> for
> >     > sources
> >     >     which aggregate different types of data is to provide filters
> at
> > the
> >     > parser
> >     >     level and have multiple parser topologies (with different,
> > possibly
> >     >     mutually exclusive filters) running.  This would be a
> completely
> >     > separate
> >     >     sensor.  Imagine a syslog data source that aggregates and you
> > want to
> >     > pick
> >     >     apart certain pieces of messages.  This is why the initial
> > thought and
> >     >     architecture was one index per sensor.
> >     >
> >     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <mattf@apache.org>
> > wrote:
> >     >
> >     >     > I’m thinking that CEP (Complex Event Processing) is contrary
> > to the
> >     > idea
> >     >     > of silo-ing data per sensor.
> >     >     > Now it’s true that some of those sensors are already
> > aggregating
> >     > data from
> >     >     > multiple sources, so maybe I’m wrong here.
> >     >     > But it just seems to me that the “data lake” insights come
> from
> >     > being able
> >     >     > to make decisions over the whole mass of data rather than
> just
> >     > vertical
> >     >     > slices of it.
> >     >     >
> >     >     > On 1/12/17, 2:15 PM, "Casey Stella" <cestella@gmail.com>
> > wrote:
> >     >     >
> >     >     >     Hey Matt,
> >     >     >
> >     >     >     Thanks for the comment!
> >     >     >     1. At the moment, we only have one index name, the
> default
> > of
> >     > which is
> >     >     > the
> >     >     >     sensor name but that's entirely up to the user.  This is
> > sensor
> >     >     > specific,
> >     >     >     so it'd be a separate config for each sensor.  If we want
> > to
> >     > build
> >     >     > multiple
> >     >     >     indices per sensor, we'd have to think carefully about
> how
> > to do
> >     > that
> >     >     > and
> >     >     >     would be a bigger undertaking.  I guess I can see the
> use,
> > though
> >     >     > (redirect
> >     >     >     messages to one index vs another based on a predicate for
> > a given
> >     >     > sensor).
> >     >     >     Anyway, not where I was originally thinking that this
> > discussion
> >     > would
> >     >     > go,
> >     >     >     but it's an interesting point.
> >     >     >
> >     >     >     2. I hadn't thought through the implementation quite yet,
> > but we
> >     > don't
> >     >     >     actually have a splitter bolt in that topology, just a
> > spout
> >     > that goes
> >     >     > to
> >     >     >     the elasticsearch writer and also to the hdfs writer.
> >     >     >
> >     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
> > mattf@apache.org>
> >     > wrote:
> >     >     >
> >     >     >     > Casey, good to have controls like this.  Couple
> > questions:
> >     >     >     >
> >     >     >     > 1. Regarding the “index” : “squid” name/value
pair, is
> > the
> >     > index name
> >     >     >     > expected to always be a sensor name?  Or is the given
> > json
> >     > structure
> >     >     >     > subordinate to a sensor name in zookeeper?  Or can we
> > build
> >     > arbitrary
> >     >     >     > indexes with this new specification, independent of
> > sensor?
> >     > Should
> >     >     > there
> >     >     >     > actually be a list of “indexes”, ie
> >     >     >     > { “indexes” : [
> >     >     >     >         {“index” : “name1”,
> >     >     >     >                 …
> >     >     >     >         },
> >     >     >     >         {“index” : “name2”,
> >     >     >     >                 …
> >     >     >     >         } ]
> >     >     >     > }
> >     >     >     >
> >     >     >     > 2. Would the filtering / writer selection logic take
> > place in
> >     > the
> >     >     > indexing
> >     >     >     > topology splitter bolt?  Seems like that would have the
> >     > smallest
> >     >     > impact on
> >     >     >     > current implementation, no?
> >     >     >     >
> >     >     >     > Sorry if these are already answered in PR-415, I
> haven’t
> > had
> >     > time to
> >     >     >     > review that one yet.
> >     >     >     > Thanks,
> >     >     >     > --Matt
> >     >     >     >
> >     >     >     >
> >     >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> >     >     > michael.miklavcic@gmail.com>
> >     >     >     > wrote:
> >     >     >     >
> >     >     >     >     I like the flexibility and expressibility of the
> > first
> >     > option
> >     >     > with
> >     >     >     > Stellar
> >     >     >     >     filters.
> >     >     >     >
> >     >     >     >     M
> >     >     >     >
> >     >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> >     >     > cestella@gmail.com>
> >     >     >     > wrote:
> >     >     >     >
> >     >     >     >     > As of METRON-652 <https://github.com/apache/
> >     >     >     > incubator-metron/pull/415>, we
> >     >     >     >     > will have decoupled the indexing configuration
> > from the
> >     >     > enrichment
> >     >     >     >     > configuration.  As an immediate follow-up to
> that,
> > I'd
> >     > like to
> >     >     >     > provide the
> >     >     >     >     > ability to turn off and on writers via the
> > configs.  I'd
> >     > like
> >     >     > to get
> >     >     >     > some
> >     >     >     >     > community feedback on how the functionality
> should
> > work,
> >     > if
> >     >     > y'all are
> >     >     >     >     > amenable. :)
> >     >     >     >     >
> >     >     >     >     >
> >     >     >     >     > As of now, we have 3 possible writers which
can
> be
> > used
> >     > in the
> >     >     >     > indexing
> >     >     >     >     > topology:
> >     >     >     >     >
> >     >     >     >     >    - Solr
> >     >     >     >     >    - Elasticsearch
> >     >     >     >     >    - HDFS
> >     >     >     >     >
> >     >     >     >     > HDFS is always used, elasticsearch or solr is
> used
> >     > depending
> >     >     > on how
> >     >     >     > you
> >     >     >     >     > start the indexing topology.
> >     >     >     >     >
> >     >     >     >     > A couple of proposals come to mind immediately:
> >     >     >     >     >
> >     >     >     >     > *Index Filtering*
> >     >     >     >     >
> >     >     >     >     > You would be able to specify a filter as defined
> > by a
> >     > stellar
> >     >     >     > statement
> >     >     >     >     > (likely a reuse of the StellarFilter that exists
> > in the
> >     >     > Parsers)
> >     >     >     > which
> >     >     >     >     > would allow you to indicate on a
> > message-by-message basis
> >     >     > whether or
> >     >     >     > not to
> >     >     >     >     > write the message.
> >     >     >     >     >
> >     >     >     >     > The semantics of this would be as follows:
> >     >     >     >     >
> >     >     >     >     >    - Default (i.e. unspecified) is to pass
> > everything
> >     > through
> >     >     > (hence
> >     >     >     >     >    backwards compatible with the current default
> > config).
> >     >     >     >     >    - Messages which have the associated stellar
> > statement
> >     >     > evaluate
> >     >     >     > to true
> >     >     >     >     >    for the writer type will be written, otherwise
> > not.
> >     >     >     >     >
> >     >     >     >     >
> >     >     >     >     > Sample indexing config which would write out
no
> > messages
> >     > to
> >     >     > HDFS and
> >     >     >     > write
> >     >     >     >     > out only messages containing a field called
> > "field1":
> >     >     >     >     > {
> >     >     >     >     >    "index" : "squid"
> >     >     >     >     >   ,"batchSize" : 100
> >     >     >     >     >   ,"filters" : {
> >     >     >     >     >       "HDFS" : "false"
> >     >     >     >     >      ,"ES" : "exists(field1)"
> >     >     >     >     >                  }
> >     >     >     >     > }
> >     >     >     >     >
> >     >     >     >     > *Index On/Off Switch*
> >     >     >     >     >
> >     >     >     >     > A simpler solution would be to just provide
a
> list
> > of
> >     > writers
> >     >     > to
> >     >     >     > write
> >     >     >     >     > messages.  The semantics would be as follows:
> >     >     >     >     >
> >     >     >     >     >    - If the list is unspecified, then the default
> > is to
> >     > write
> >     >     > all
> >     >     >     > messages
> >     >     >     >     >    for every writer in the indexing topology
> >     >     >     >     >    - If the list is specified, then a writer
will
> > write
> >     > all
> >     >     > messages
> >     >     >     > if and
> >     >     >     >     >    only if it is named in the list.
> >     >     >     >     >
> >     >     >     >     > Sample indexing config which turns off HDFS
and
> > keeps on
> >     >     >     > Elasticsearch:
> >     >     >     >     > {
> >     >     >     >     >    "index" : "squid"
> >     >     >     >     >   ,"batchSize" : 100
> >     >     >     >     >   ,"writers" : [ "ES" ]
> >     >     >     >     > }
> >     >     >     >     >
> >     >     >     >     > Thanks in advance for the feedback!  Also, if
you
> > have
> >     > any
> >     >     > other,
> >     >     >     > better
> >     >     >     >     > ideas than the ones presented here, let me know
> > too.
> >     >     >     >     >
> >     >     >     >     > Best,
> >     >     >     >     >
> >     >     >     >     > Casey
> >     >     >     >     >
> >     >     >     >
> >     >     >     >
> >     >     >     >
> >     >     >     >
> >     >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message