metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Fri, 13 Jan 2017 15:39:14 GMT
Are you saying we support all of these variants?  I realize you are trying
to have some backwards compatibility, but this also makes it harder for a
user to grok (for me at least).

Personally I like my original example as there are fewer sub-structures,
like 'writerConfig', which makes the whole thing simpler and easier to
grok.  But maybe others will think your proposal is just as easy to grok.



On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <cestella@gmail.com> wrote:

> Ok, so here's what I'm thinking based on the discussion:
>
>    - Keeping the configs that we have now (batchSize and index) as defaults
>    for the unspecified writer-specific case
>    - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
>    - all writers write all messages
>    - index named the same as the sensor for all writers
>    - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
>   "index" : "foo"
>  ,"batchSize" : 100
> }
>
>    - All writers write all messages
>    - index is named "foo", different from the sensor for all writers
>    - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>    {
>       "elasticsearch" : {
>                                    "batchSize" : 100
>                                  }
>    }
> }
>
>    - All writers write all messages
>    - index is named "foo", different from the sensor for all writers
>    - batchSize is 1 for HDFS and 100 for elasticsearch writers
>    - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>    {
>       "elasticsearch" : {
>                                    "batchSize" : 100,
>                                    "when" : "exists(field1)"
>                                  },
>       "hdfs" : {
>                      "when" : "false"
>                   }
>    }
> }
>
>    - ES writer writes messages which have field1, HDFS doesn't
>    - index is named "foo", different from the sensor for all writers
>    - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <cduby@hortonworks.com>
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella" <cestella@gmail.com> wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <nick@nickallen.org> wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides.  Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings?  For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >>   "hdfs" : {
> > >>     "when": "exists(field1)",
> > >>     "batchSize": 100
> > >>   },
> > >>
> > >>   "elasticsearch" : {
> > >>     "when": "true",
> > >>     "batchSize": 1000,
> > >>     "index": "squid"
> > >>   }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <cestella@gmail.com>
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too.  Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread.  Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be.  Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields.  I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty).  I exchange
> > >> > positions fairly regularly on that question. ;)  It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichardson2@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able
to
> > >> > analyze
> > >> > > that data set, I need the data to be indexed all together (likely
> in
> > >> > HDFS)
> > >> > > and to have a normalized schema such that IP address, URL, and
> user
> > >> name
> > >> > > (to take a few) can be easily queried and aggregated. I can also
> > >> envision
> > >> > > scenarios where I would want to index data based on attributes
> other
> > >> than
> > >> > > sensor, business unit or subsidiary for example.
> > >> > >
> > >> > > I've been wanted to propose extending our 7 standard fields to
> > include
> > >> > > things like URL and user. Is there community interest/support
for
> > >> moving
> > >> > in
> > >> > > that direction? If so, I'll start a new thread.
> > >> > >
> > >> > > Thanks!
> > >> > >
> > >> > > -Kyle
> > >> > >
> > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <mattf@apache.org>
> > wrote:
> > >> > >
> > >> > > > Ah, I see.  If overriding the default index name allows
using
> the
> > >> same
> > >> > > > name for multiple sensors, then the goal can be achieved.
> > >> > > > Thanks,
> > >> > > > --Matt
> > >> > > >
> > >> > > >
> > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <cestella@gmail.com>
wrote:
> > >> > > >
> > >> > > >     Oh, you could!  Let's say you have a syslog parser with
data
> > from
> > >> > > > sources 1
> > >> > > >     2 and 3.  You'd end up with one kafka queue with 3 parsers
> > >> attached
> > >> > > to
> > >> > > > that
> > >> > > >     queue, each picking part the messages from source 1,
2 and
> 3.
> > >> > They'd
> > >> > > > go
> > >> > > >     through separate enrichment and into the indexing topology.
> > In
> > >> the
> > >> > > >     indexing topology, you could specify the same index
name
> > "syslog"
> > >> > and
> > >> > > > all
> > >> > > >     of the messages go into the same index for CEP querying
if
> so
> > >> > > desired.
> > >> > > >
> > >> > > >     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> mattf@apache.org
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > >     > Syslog is hell on parsers – I know, I worked
at LogLogic
> in
> > a
> > >> > > > previous
> > >> > > >     > life.  It makes perfect sense to route different
lines
> from
> > >> > syslog
> > >> > > > through
> > >> > > >     > different appropriate parsers.  But a lot of what
the
> > parsers
> > >> do
> > >> > is
> > >> > > >     > identify consistent subsets of metadata and annotate
it –
> > eg,
> > >> > > > src_ip_addr,
> > >> > > >     > event timestamps, etc.  Once those metadata are
annotated
> > and
> > >> > > > available
> > >> > > >     > with common field names, why doesn’t it make
sense to
> index
> > the
> > >> > > > messages
> > >> > > >     > together, for CEP querying?  I think Splunk has
> illustrated
> > >> this
> > >> > > > model.
> > >> > > >     >
> > >> > > >     > On 1/12/17, 3:00 PM, "Casey Stella" <cestella@gmail.com>
> > >> wrote:
> > >> > > >     >
> > >> > > >     >     yeah, I mean, honestly, I think the approach
that
> we've
> > >> taken
> > >> > > for
> > >> > > >     > sources
> > >> > > >     >     which aggregate different types of data is
to provide
> > >> filters
> > >> > > at
> > >> > > > the
> > >> > > >     > parser
> > >> > > >     >     level and have multiple parser topologies (with
> > different,
> > >> > > > possibly
> > >> > > >     >     mutually exclusive filters) running.  This
would be a
> > >> > > completely
> > >> > > >     > separate
> > >> > > >     >     sensor.  Imagine a syslog data source that
aggregates
> > and
> > >> you
> > >> > > > want to
> > >> > > >     > pick
> > >> > > >     >     apart certain pieces of messages.  This is
why the
> > initial
> > >> > > > thought and
> > >> > > >     >     architecture was one index per sensor.
> > >> > > >     >
> > >> > > >     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley
<
> > >> > mattf@apache.org>
> > >> > > > wrote:
> > >> > > >     >
> > >> > > >     >     > I’m thinking that CEP (Complex Event
Processing) is
> > >> > contrary
> > >> > > > to the
> > >> > > >     > idea
> > >> > > >     >     > of silo-ing data per sensor.
> > >> > > >     >     > Now it’s true that some of those sensors
are already
> > >> > > > aggregating
> > >> > > >     > data from
> > >> > > >     >     > multiple sources, so maybe I’m wrong
here.
> > >> > > >     >     > But it just seems to me that the “data
lake”
> insights
> > >> come
> > >> > > from
> > >> > > >     > being able
> > >> > > >     >     > to make decisions over the whole mass
of data rather
> > than
> > >> > > just
> > >> > > >     > vertical
> > >> > > >     >     > slices of it.
> > >> > > >     >     >
> > >> > > >     >     > On 1/12/17, 2:15 PM, "Casey Stella" <
> > cestella@gmail.com>
> > >> > > > wrote:
> > >> > > >     >     >
> > >> > > >     >     >     Hey Matt,
> > >> > > >     >     >
> > >> > > >     >     >     Thanks for the comment!
> > >> > > >     >     >     1. At the moment, we only have one
index name,
> the
> > >> > > default
> > >> > > > of
> > >> > > >     > which is
> > >> > > >     >     > the
> > >> > > >     >     >     sensor name but that's entirely up
to the user.
> > This
> > >> > is
> > >> > > > sensor
> > >> > > >     >     > specific,
> > >> > > >     >     >     so it'd be a separate config for each
sensor.
> If
> > we
> > >> > want
> > >> > > > to
> > >> > > >     > build
> > >> > > >     >     > multiple
> > >> > > >     >     >     indices per sensor, we'd have to think
carefully
> > >> about
> > >> > > how
> > >> > > > to do
> > >> > > >     > that
> > >> > > >     >     > and
> > >> > > >     >     >     would be a bigger undertaking.  I
guess I can
> see
> > the
> > >> > > use,
> > >> > > > though
> > >> > > >     >     > (redirect
> > >> > > >     >     >     messages to one index vs another based
on a
> > predicate
> > >> > for
> > >> > > > a given
> > >> > > >     >     > sensor).
> > >> > > >     >     >     Anyway, not where I was originally
thinking that
> > this
> > >> > > > discussion
> > >> > > >     > would
> > >> > > >     >     > go,
> > >> > > >     >     >     but it's an interesting point.
> > >> > > >     >     >
> > >> > > >     >     >     2. I hadn't thought through the implementation
> > quite
> > >> > yet,
> > >> > > > but we
> > >> > > >     > don't
> > >> > > >     >     >     actually have a splitter bolt in that
topology,
> > just
> > >> a
> > >> > > > spout
> > >> > > >     > that goes
> > >> > > >     >     > to
> > >> > > >     >     >     the elasticsearch writer and also
to the hdfs
> > writer.
> > >> > > >     >     >
> > >> > > >     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt
Foley <
> > >> > > > mattf@apache.org>
> > >> > > >     > wrote:
> > >> > > >     >     >
> > >> > > >     >     >     > Casey, good to have controls
like this.
> Couple
> > >> > > > questions:
> > >> > > >     >     >     >
> > >> > > >     >     >     > 1. Regarding the “index”
: “squid” name/value
> > pair,
> > >> > is
> > >> > > > the
> > >> > > >     > index name
> > >> > > >     >     >     > expected to always be a sensor
name?  Or is
> the
> > >> given
> > >> > > > json
> > >> > > >     > structure
> > >> > > >     >     >     > subordinate to a sensor name
in zookeeper?  Or
> > can
> > >> we
> > >> > > > build
> > >> > > >     > arbitrary
> > >> > > >     >     >     > indexes with this new specification,
> > independent of
> > >> > > > sensor?
> > >> > > >     > Should
> > >> > > >     >     > there
> > >> > > >     >     >     > actually be a list of “indexes”,
ie
> > >> > > >     >     >     > { “indexes” : [
> > >> > > >     >     >     >         {“index” : “name1”,
> > >> > > >     >     >     >                 …
> > >> > > >     >     >     >         },
> > >> > > >     >     >     >         {“index” : “name2”,
> > >> > > >     >     >     >                 …
> > >> > > >     >     >     >         } ]
> > >> > > >     >     >     > }
> > >> > > >     >     >     >
> > >> > > >     >     >     > 2. Would the filtering / writer
selection
> logic
> > >> take
> > >> > > > place in
> > >> > > >     > the
> > >> > > >     >     > indexing
> > >> > > >     >     >     > topology splitter bolt?  Seems
like that would
> > have
> > >> > the
> > >> > > >     > smallest
> > >> > > >     >     > impact on
> > >> > > >     >     >     > current implementation, no?
> > >> > > >     >     >     >
> > >> > > >     >     >     > Sorry if these are already answered
in
> PR-415, I
> > >> > > haven’t
> > >> > > > had
> > >> > > >     > time to
> > >> > > >     >     >     > review that one yet.
> > >> > > >     >     >     > Thanks,
> > >> > > >     >     >     > --Matt
> > >> > > >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >     > On 1/12/17, 12:55 PM, "Michael
Miklavcic" <
> > >> > > >     >     > michael.miklavcic@gmail.com>
> > >> > > >     >     >     > wrote:
> > >> > > >     >     >     >
> > >> > > >     >     >     >     I like the flexibility and
expressibility
> of
> > >> the
> > >> > > > first
> > >> > > >     > option
> > >> > > >     >     > with
> > >> > > >     >     >     > Stellar
> > >> > > >     >     >     >     filters.
> > >> > > >     >     >     >
> > >> > > >     >     >     >     M
> > >> > > >     >     >     >
> > >> > > >     >     >     >     On Thu, Jan 12, 2017 at 1:51
PM, Casey
> > Stella <
> > >> > > >     >     > cestella@gmail.com>
> > >> > > >     >     >     > wrote:
> > >> > > >     >     >     >
> > >> > > >     >     >     >     > As of METRON-652 <
> > https://github.com/apache/
> > >> > > >     >     >     > incubator-metron/pull/415>,
we
> > >> > > >     >     >     >     > will have decoupled
the indexing
> > >> configuration
> > >> > > > from the
> > >> > > >     >     > enrichment
> > >> > > >     >     >     >     > configuration.  As an
immediate
> follow-up
> > to
> > >> > > that,
> > >> > > > I'd
> > >> > > >     > like to
> > >> > > >     >     >     > provide the
> > >> > > >     >     >     >     > ability to turn off
and on writers via
> the
> > >> > > > configs.  I'd
> > >> > > >     > like
> > >> > > >     >     > to get
> > >> > > >     >     >     > some
> > >> > > >     >     >     >     > community feedback on
how the
> > functionality
> > >> > > should
> > >> > > > work,
> > >> > > >     > if
> > >> > > >     >     > y'all are
> > >> > > >     >     >     >     > amenable. :)
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > As of now, we have 3
possible writers
> > which
> > >> can
> > >> > > be
> > >> > > > used
> > >> > > >     > in the
> > >> > > >     >     >     > indexing
> > >> > > >     >     >     >     > topology:
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     >    - Solr
> > >> > > >     >     >     >     >    - Elasticsearch
> > >> > > >     >     >     >     >    - HDFS
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > HDFS is always used,
elasticsearch or
> > solr is
> > >> > > used
> > >> > > >     > depending
> > >> > > >     >     > on how
> > >> > > >     >     >     > you
> > >> > > >     >     >     >     > start the indexing topology.
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > A couple of proposals
come to mind
> > >> immediately:
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > *Index Filtering*
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > You would be able to
specify a filter as
> > >> > defined
> > >> > > > by a
> > >> > > >     > stellar
> > >> > > >     >     >     > statement
> > >> > > >     >     >     >     > (likely a reuse of the
StellarFilter
> that
> > >> > exists
> > >> > > > in the
> > >> > > >     >     > Parsers)
> > >> > > >     >     >     > which
> > >> > > >     >     >     >     > would allow you to indicate
on a
> > >> > > > message-by-message basis
> > >> > > >     >     > whether or
> > >> > > >     >     >     > not to
> > >> > > >     >     >     >     > write the message.
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > The semantics of this
would be as
> follows:
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     >    - Default (i.e. unspecified)
is to
> pass
> > >> > > > everything
> > >> > > >     > through
> > >> > > >     >     > (hence
> > >> > > >     >     >     >     >    backwards compatible
with the current
> > >> > default
> > >> > > > config).
> > >> > > >     >     >     >     >    - Messages which
have the associated
> > >> stellar
> > >> > > > statement
> > >> > > >     >     > evaluate
> > >> > > >     >     >     > to true
> > >> > > >     >     >     >     >    for the writer type
will be written,
> > >> > otherwise
> > >> > > > not.
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > Sample indexing config
which would write
> > out
> > >> no
> > >> > > > messages
> > >> > > >     > to
> > >> > > >     >     > HDFS and
> > >> > > >     >     >     > write
> > >> > > >     >     >     >     > out only messages containing
a field
> > called
> > >> > > > "field1":
> > >> > > >     >     >     >     > {
> > >> > > >     >     >     >     >    "index" : "squid"
> > >> > > >     >     >     >     >   ,"batchSize" : 100
> > >> > > >     >     >     >     >   ,"filters" : {
> > >> > > >     >     >     >     >       "HDFS" : "false"
> > >> > > >     >     >     >     >      ,"ES" : "exists(field1)"
> > >> > > >     >     >     >     >                  }
> > >> > > >     >     >     >     > }
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > *Index On/Off Switch*
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > A simpler solution would
be to just
> > provide a
> > >> > > list
> > >> > > > of
> > >> > > >     > writers
> > >> > > >     >     > to
> > >> > > >     >     >     > write
> > >> > > >     >     >     >     > messages.  The semantics
would be as
> > follows:
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     >    - If the list is
unspecified, then
> the
> > >> > default
> > >> > > > is to
> > >> > > >     > write
> > >> > > >     >     > all
> > >> > > >     >     >     > messages
> > >> > > >     >     >     >     >    for every writer
in the indexing
> > topology
> > >> > > >     >     >     >     >    - If the list is
specified, then a
> > writer
> > >> > will
> > >> > > > write
> > >> > > >     > all
> > >> > > >     >     > messages
> > >> > > >     >     >     > if and
> > >> > > >     >     >     >     >    only if it is named
in the list.
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > Sample indexing config
which turns off
> > HDFS
> > >> and
> > >> > > > keeps on
> > >> > > >     >     >     > Elasticsearch:
> > >> > > >     >     >     >     > {
> > >> > > >     >     >     >     >    "index" : "squid"
> > >> > > >     >     >     >     >   ,"batchSize" : 100
> > >> > > >     >     >     >     >   ,"writers" : [ "ES"
]
> > >> > > >     >     >     >     > }
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > Thanks in advance for
the feedback!
> > Also, if
> > >> > you
> > >> > > > have
> > >> > > >     > any
> > >> > > >     >     > other,
> > >> > > >     >     >     > better
> > >> > > >     >     >     >     > ideas than the ones
presented here, let
> me
> > >> know
> > >> > > > too.
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > Best,
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >     > Casey
> > >> > > >     >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >     >
> > >> > > >     >     >
> > >> > > >     >     >
> > >> > > >     >     >
> > >> > > >     >     >
> > >> > > >     >     >
> > >> > > >     >
> > >> > > >     >
> > >> > > >     >
> > >> > > >     >
> > >> > > >     >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Nick Allen <nick@nickallen.org>
> > >>
> >
>



-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message