metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Thu, 12 Jan 2017 23:51:16 GMT
Ah, I see.  If overriding the default index name allows using the same name for multiple sensors,
then the goal can be achieved.
Thanks,
--Matt


On 1/12/17, 3:30 PM, "Casey Stella" <cestella@gmail.com> wrote:

    Oh, you could!  Let's say you have a syslog parser with data from sources 1
    2 and 3.  You'd end up with one kafka queue with 3 parsers attached to that
    queue, each picking part the messages from source 1, 2 and 3.  They'd go
    through separate enrichment and into the indexing topology.  In the
    indexing topology, you could specify the same index name "syslog" and all
    of the messages go into the same index for CEP querying if so desired.
    
    On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <mattf@apache.org> wrote:
    
    > Syslog is hell on parsers – I know, I worked at LogLogic in a previous
    > life.  It makes perfect sense to route different lines from syslog through
    > different appropriate parsers.  But a lot of what the parsers do is
    > identify consistent subsets of metadata and annotate it – eg, src_ip_addr,
    > event timestamps, etc.  Once those metadata are annotated and available
    > with common field names, why doesn’t it make sense to index the messages
    > together, for CEP querying?  I think Splunk has illustrated this model.
    >
    > On 1/12/17, 3:00 PM, "Casey Stella" <cestella@gmail.com> wrote:
    >
    >     yeah, I mean, honestly, I think the approach that we've taken for
    > sources
    >     which aggregate different types of data is to provide filters at the
    > parser
    >     level and have multiple parser topologies (with different, possibly
    >     mutually exclusive filters) running.  This would be a completely
    > separate
    >     sensor.  Imagine a syslog data source that aggregates and you want to
    > pick
    >     apart certain pieces of messages.  This is why the initial thought and
    >     architecture was one index per sensor.
    >
    >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <mattf@apache.org> wrote:
    >
    >     > I’m thinking that CEP (Complex Event Processing) is contrary to the
    > idea
    >     > of silo-ing data per sensor.
    >     > Now it’s true that some of those sensors are already aggregating
    > data from
    >     > multiple sources, so maybe I’m wrong here.
    >     > But it just seems to me that the “data lake” insights come from
    > being able
    >     > to make decisions over the whole mass of data rather than just
    > vertical
    >     > slices of it.
    >     >
    >     > On 1/12/17, 2:15 PM, "Casey Stella" <cestella@gmail.com> wrote:
    >     >
    >     >     Hey Matt,
    >     >
    >     >     Thanks for the comment!
    >     >     1. At the moment, we only have one index name, the default of
    > which is
    >     > the
    >     >     sensor name but that's entirely up to the user.  This is sensor
    >     > specific,
    >     >     so it'd be a separate config for each sensor.  If we want to
    > build
    >     > multiple
    >     >     indices per sensor, we'd have to think carefully about how to do
    > that
    >     > and
    >     >     would be a bigger undertaking.  I guess I can see the use, though
    >     > (redirect
    >     >     messages to one index vs another based on a predicate for a given
    >     > sensor).
    >     >     Anyway, not where I was originally thinking that this discussion
    > would
    >     > go,
    >     >     but it's an interesting point.
    >     >
    >     >     2. I hadn't thought through the implementation quite yet, but we
    > don't
    >     >     actually have a splitter bolt in that topology, just a spout
    > that goes
    >     > to
    >     >     the elasticsearch writer and also to the hdfs writer.
    >     >
    >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <mattf@apache.org>
    > wrote:
    >     >
    >     >     > Casey, good to have controls like this.  Couple questions:
    >     >     >
    >     >     > 1. Regarding the “index” : “squid” name/value pair, is
the
    > index name
    >     >     > expected to always be a sensor name?  Or is the given json
    > structure
    >     >     > subordinate to a sensor name in zookeeper?  Or can we build
    > arbitrary
    >     >     > indexes with this new specification, independent of sensor?
    > Should
    >     > there
    >     >     > actually be a list of “indexes”, ie
    >     >     > { “indexes” : [
    >     >     >         {“index” : “name1”,
    >     >     >                 …
    >     >     >         },
    >     >     >         {“index” : “name2”,
    >     >     >                 …
    >     >     >         } ]
    >     >     > }
    >     >     >
    >     >     > 2. Would the filtering / writer selection logic take place in
    > the
    >     > indexing
    >     >     > topology splitter bolt?  Seems like that would have the
    > smallest
    >     > impact on
    >     >     > current implementation, no?
    >     >     >
    >     >     > Sorry if these are already answered in PR-415, I haven’t had
    > time to
    >     >     > review that one yet.
    >     >     > Thanks,
    >     >     > --Matt
    >     >     >
    >     >     >
    >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
    >     > michael.miklavcic@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >     I like the flexibility and expressibility of the first
    > option
    >     > with
    >     >     > Stellar
    >     >     >     filters.
    >     >     >
    >     >     >     M
    >     >     >
    >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
    >     > cestella@gmail.com>
    >     >     > wrote:
    >     >     >
    >     >     >     > As of METRON-652 <https://github.com/apache/
    >     >     > incubator-metron/pull/415>, we
    >     >     >     > will have decoupled the indexing configuration from the
    >     > enrichment
    >     >     >     > configuration.  As an immediate follow-up to that, I'd
    > like to
    >     >     > provide the
    >     >     >     > ability to turn off and on writers via the configs.  I'd
    > like
    >     > to get
    >     >     > some
    >     >     >     > community feedback on how the functionality should work,
    > if
    >     > y'all are
    >     >     >     > amenable. :)
    >     >     >     >
    >     >     >     >
    >     >     >     > As of now, we have 3 possible writers which can be used
    > in the
    >     >     > indexing
    >     >     >     > topology:
    >     >     >     >
    >     >     >     >    - Solr
    >     >     >     >    - Elasticsearch
    >     >     >     >    - HDFS
    >     >     >     >
    >     >     >     > HDFS is always used, elasticsearch or solr is used
    > depending
    >     > on how
    >     >     > you
    >     >     >     > start the indexing topology.
    >     >     >     >
    >     >     >     > A couple of proposals come to mind immediately:
    >     >     >     >
    >     >     >     > *Index Filtering*
    >     >     >     >
    >     >     >     > You would be able to specify a filter as defined by a
    > stellar
    >     >     > statement
    >     >     >     > (likely a reuse of the StellarFilter that exists in the
    >     > Parsers)
    >     >     > which
    >     >     >     > would allow you to indicate on a message-by-message basis
    >     > whether or
    >     >     > not to
    >     >     >     > write the message.
    >     >     >     >
    >     >     >     > The semantics of this would be as follows:
    >     >     >     >
    >     >     >     >    - Default (i.e. unspecified) is to pass everything
    > through
    >     > (hence
    >     >     >     >    backwards compatible with the current default config).
    >     >     >     >    - Messages which have the associated stellar statement
    >     > evaluate
    >     >     > to true
    >     >     >     >    for the writer type will be written, otherwise not.
    >     >     >     >
    >     >     >     >
    >     >     >     > Sample indexing config which would write out no messages
    > to
    >     > HDFS and
    >     >     > write
    >     >     >     > out only messages containing a field called "field1":
    >     >     >     > {
    >     >     >     >    "index" : "squid"
    >     >     >     >   ,"batchSize" : 100
    >     >     >     >   ,"filters" : {
    >     >     >     >       "HDFS" : "false"
    >     >     >     >      ,"ES" : "exists(field1)"
    >     >     >     >                  }
    >     >     >     > }
    >     >     >     >
    >     >     >     > *Index On/Off Switch*
    >     >     >     >
    >     >     >     > A simpler solution would be to just provide a list of
    > writers
    >     > to
    >     >     > write
    >     >     >     > messages.  The semantics would be as follows:
    >     >     >     >
    >     >     >     >    - If the list is unspecified, then the default is to
    > write
    >     > all
    >     >     > messages
    >     >     >     >    for every writer in the indexing topology
    >     >     >     >    - If the list is specified, then a writer will write
    > all
    >     > messages
    >     >     > if and
    >     >     >     >    only if it is named in the list.
    >     >     >     >
    >     >     >     > Sample indexing config which turns off HDFS and keeps
on
    >     >     > Elasticsearch:
    >     >     >     > {
    >     >     >     >    "index" : "squid"
    >     >     >     >   ,"batchSize" : 100
    >     >     >     >   ,"writers" : [ "ES" ]
    >     >     >     > }
    >     >     >     >
    >     >     >     > Thanks in advance for the feedback!  Also, if you have
    > any
    >     > other,
    >     >     > better
    >     >     >     > ideas than the ones presented here, let me know too.
    >     >     >     >
    >     >     >     > Best,
    >     >     >     >
    >     >     >     > Casey
    >     >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >
    >
    >
    >
    >
    



Mime
View raw message