metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Allen <n...@nickallen.org>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Fri, 13 Jan 2017 15:34:28 GMT
I was thinking there would only be one 'when' for each output.  So if we
have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
statement could be as simple or complex as you need.

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler <ottobackwards@gmail.com>
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (cestella@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <cduby@hortonworks.com>
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella" <cestella@gmail.com> wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <nick@nickallen.org> wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <cestella@gmail.com>
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be worth
> a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should have
> a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichardson2@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able
to
> > >> > analyze
> > >> > > that data set, I need the data to be indexed all together (likely
> in
> > >> > HDFS)
> > >> > > and to have a normalized schema such that IP address, URL, and
> user
> > >> name
> > >> > > (to take a few) can be easily queried and aggregated. I can also
> > >> envision
> > >> > > scenarios where I would want to index data based on attributes
> other
> > >> than
> > >> > > sensor, business unit or subsidiary for example.
> > >> > >
> > >> > > I've been wanted to propose extending our 7 standard fields to
> > include
> > >> > > things like URL and user. Is there community interest/support
for
> > >> moving
> > >> > in
> > >> > > that direction? If so, I'll start a new thread.
> > >> > >
> > >> > > Thanks!
> > >> > >
> > >> > > -Kyle
> > >> > >
> > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <mattf@apache.org>
> > wrote:
> > >> > >
> > >> > > > Ah, I see. If overriding the default index name allows using
the
> > >> same
> > >> > > > name for multiple sensors, then the goal can be achieved.
> > >> > > > Thanks,
> > >> > > > --Matt
> > >> > > >
> > >> > > >
> > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <cestella@gmail.com>
wrote:
> > >> > > >
> > >> > > > Oh, you could! Let's say you have a syslog parser with data
> > from
> > >> > > > sources 1
> > >> > > > 2 and 3. You'd end up with one kafka queue with 3 parsers
> > >> attached
> > >> > > to
> > >> > > > that
> > >> > > > queue, each picking part the messages from source 1, 2 and
3.
> > >> > They'd
> > >> > > > go
> > >> > > > through separate enrichment and into the indexing topology.
> > In
> > >> the
> > >> > > > indexing topology, you could specify the same index name
> > "syslog"
> > >> > and
> > >> > > > all
> > >> > > > of the messages go into the same index for CEP querying
if so
> > >> > > desired.
> > >> > > >
> > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <mattf@apache.org
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > Syslog is hell on parsers – I know, I worked at LogLogic
in
> > a
> > >> > > > previous
> > >> > > > > life. It makes perfect sense to route different lines
from
> > >> > syslog
> > >> > > > through
> > >> > > > > different appropriate parsers. But a lot of what the
> > parsers
> > >> do
> > >> > is
> > >> > > > > identify consistent subsets of metadata and annotate
it –
> > eg,
> > >> > > > src_ip_addr,
> > >> > > > > event timestamps, etc. Once those metadata are annotated
> > and
> > >> > > > available
> > >> > > > > with common field names, why doesn’t it make sense
to index
> > the
> > >> > > > messages
> > >> > > > > together, for CEP querying? I think Splunk has illustrated
> > >> this
> > >> > > > model.
> > >> > > > >
> > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <cestella@gmail.com>
> > >> wrote:
> > >> > > > >
> > >> > > > > yeah, I mean, honestly, I think the approach that we've
> > >> taken
> > >> > > for
> > >> > > > > sources
> > >> > > > > which aggregate different types of data is to provide
> > >> filters
> > >> > > at
> > >> > > > the
> > >> > > > > parser
> > >> > > > > level and have multiple parser topologies (with
> > different,
> > >> > > > possibly
> > >> > > > > mutually exclusive filters) running. This would be
a
> > >> > > completely
> > >> > > > > separate
> > >> > > > > sensor. Imagine a syslog data source that aggregates
> > and
> > >> you
> > >> > > > want to
> > >> > > > > pick
> > >> > > > > apart certain pieces of messages. This is why the
> > initial
> > >> > > > thought and
> > >> > > > > architecture was one index per sensor.
> > >> > > > >
> > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
> > >> > mattf@apache.org>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > I’m thinking that CEP (Complex Event Processing)
is
> > >> > contrary
> > >> > > > to the
> > >> > > > > idea
> > >> > > > > > of silo-ing data per sensor.
> > >> > > > > > Now it’s true that some of those sensors are
already
> > >> > > > aggregating
> > >> > > > > data from
> > >> > > > > > multiple sources, so maybe I’m wrong here.
> > >> > > > > > But it just seems to me that the “data lake”
insights
> > >> come
> > >> > > from
> > >> > > > > being able
> > >> > > > > > to make decisions over the whole mass of data
rather
> > than
> > >> > > just
> > >> > > > > vertical
> > >> > > > > > slices of it.
> > >> > > > > >
> > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <
> > cestella@gmail.com>
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > > Hey Matt,
> > >> > > > > >
> > >> > > > > > Thanks for the comment!
> > >> > > > > > 1. At the moment, we only have one index name,
the
> > >> > > default
> > >> > > > of
> > >> > > > > which is
> > >> > > > > > the
> > >> > > > > > sensor name but that's entirely up to the user.
> > This
> > >> > is
> > >> > > > sensor
> > >> > > > > > specific,
> > >> > > > > > so it'd be a separate config for each sensor.
If
> > we
> > >> > want
> > >> > > > to
> > >> > > > > build
> > >> > > > > > multiple
> > >> > > > > > indices per sensor, we'd have to think carefully
> > >> about
> > >> > > how
> > >> > > > to do
> > >> > > > > that
> > >> > > > > > and
> > >> > > > > > would be a bigger undertaking. I guess I can see
> > the
> > >> > > use,
> > >> > > > though
> > >> > > > > > (redirect
> > >> > > > > > messages to one index vs another based on a
> > predicate
> > >> > for
> > >> > > > a given
> > >> > > > > > sensor).
> > >> > > > > > Anyway, not where I was originally thinking that
> > this
> > >> > > > discussion
> > >> > > > > would
> > >> > > > > > go,
> > >> > > > > > but it's an interesting point.
> > >> > > > > >
> > >> > > > > > 2. I hadn't thought through the implementation
> > quite
> > >> > yet,
> > >> > > > but we
> > >> > > > > don't
> > >> > > > > > actually have a splitter bolt in that topology,
> > just
> > >> a
> > >> > > > spout
> > >> > > > > that goes
> > >> > > > > > to
> > >> > > > > > the elasticsearch writer and also to the hdfs
> > writer.
> > >> > > > > >
> > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
> > >> > > > mattf@apache.org>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Casey, good to have controls like this. Couple
> > >> > > > questions:
> > >> > > > > > >
> > >> > > > > > > 1. Regarding the “index” : “squid”
name/value
> > pair,
> > >> > is
> > >> > > > the
> > >> > > > > index name
> > >> > > > > > > expected to always be a sensor name? Or is
the
> > >> given
> > >> > > > json
> > >> > > > > structure
> > >> > > > > > > subordinate to a sensor name in zookeeper?
Or
> > can
> > >> we
> > >> > > > build
> > >> > > > > arbitrary
> > >> > > > > > > indexes with this new specification,
> > independent of
> > >> > > > sensor?
> > >> > > > > Should
> > >> > > > > > there
> > >> > > > > > > actually be a list of “indexes”, ie
> > >> > > > > > > { “indexes” : [
> > >> > > > > > > {“index” : “name1”,
> > >> > > > > > > …
> > >> > > > > > > },
> > >> > > > > > > {“index” : “name2”,
> > >> > > > > > > …
> > >> > > > > > > } ]
> > >> > > > > > > }
> > >> > > > > > >
> > >> > > > > > > 2. Would the filtering / writer selection
logic
> > >> take
> > >> > > > place in
> > >> > > > > the
> > >> > > > > > indexing
> > >> > > > > > > topology splitter bolt? Seems like that would
> > have
> > >> > the
> > >> > > > > smallest
> > >> > > > > > impact on
> > >> > > > > > > current implementation, no?
> > >> > > > > > >
> > >> > > > > > > Sorry if these are already answered in PR-415,
I
> > >> > > haven’t
> > >> > > > had
> > >> > > > > time to
> > >> > > > > > > review that one yet.
> > >> > > > > > > Thanks,
> > >> > > > > > > --Matt
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On 1/12/17, 12:55 PM, "Michael Miklavcic"
<
> > >> > > > > > michael.miklavcic@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > I like the flexibility and expressibility
of
> > >> the
> > >> > > > first
> > >> > > > > option
> > >> > > > > > with
> > >> > > > > > > Stellar
> > >> > > > > > > filters.
> > >> > > > > > >
> > >> > > > > > > M
> > >> > > > > > >
> > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, Casey
> > Stella <
> > >> > > > > > cestella@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > As of METRON-652 <
> > https://github.com/apache/
> > >> > > > > > > incubator-metron/pull/415>, we
> > >> > > > > > > > will have decoupled the indexing
> > >> configuration
> > >> > > > from the
> > >> > > > > > enrichment
> > >> > > > > > > > configuration. As an immediate follow-up
> > to
> > >> > > that,
> > >> > > > I'd
> > >> > > > > like to
> > >> > > > > > > provide the
> > >> > > > > > > > ability to turn off and on writers via
the
> > >> > > > configs. I'd
> > >> > > > > like
> > >> > > > > > to get
> > >> > > > > > > some
> > >> > > > > > > > community feedback on how the
> > functionality
> > >> > > should
> > >> > > > work,
> > >> > > > > if
> > >> > > > > > y'all are
> > >> > > > > > > > amenable. :)
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > As of now, we have 3 possible writers
> > which
> > >> can
> > >> > > be
> > >> > > > used
> > >> > > > > in the
> > >> > > > > > > indexing
> > >> > > > > > > > topology:
> > >> > > > > > > >
> > >> > > > > > > > - Solr
> > >> > > > > > > > - Elasticsearch
> > >> > > > > > > > - HDFS
> > >> > > > > > > >
> > >> > > > > > > > HDFS is always used, elasticsearch or
> > solr is
> > >> > > used
> > >> > > > > depending
> > >> > > > > > on how
> > >> > > > > > > you
> > >> > > > > > > > start the indexing topology.
> > >> > > > > > > >
> > >> > > > > > > > A couple of proposals come to mind
> > >> immediately:
> > >> > > > > > > >
> > >> > > > > > > > *Index Filtering*
> > >> > > > > > > >
> > >> > > > > > > > You would be able to specify a filter
as
> > >> > defined
> > >> > > > by a
> > >> > > > > stellar
> > >> > > > > > > statement
> > >> > > > > > > > (likely a reuse of the StellarFilter
that
> > >> > exists
> > >> > > > in the
> > >> > > > > > Parsers)
> > >> > > > > > > which
> > >> > > > > > > > would allow you to indicate on a
> > >> > > > message-by-message basis
> > >> > > > > > whether or
> > >> > > > > > > not to
> > >> > > > > > > > write the message.
> > >> > > > > > > >
> > >> > > > > > > > The semantics of this would be as follows:
> > >> > > > > > > >
> > >> > > > > > > > - Default (i.e. unspecified) is to pass
> > >> > > > everything
> > >> > > > > through
> > >> > > > > > (hence
> > >> > > > > > > > backwards compatible with the current
> > >> > default
> > >> > > > config).
> > >> > > > > > > > - Messages which have the associated
> > >> stellar
> > >> > > > statement
> > >> > > > > > evaluate
> > >> > > > > > > to true
> > >> > > > > > > > for the writer type will be written,
> > >> > otherwise
> > >> > > > not.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Sample indexing config which would write
> > out
> > >> no
> > >> > > > messages
> > >> > > > > to
> > >> > > > > > HDFS and
> > >> > > > > > > write
> > >> > > > > > > > out only messages containing a field
> > called
> > >> > > > "field1":
> > >> > > > > > > > {
> > >> > > > > > > > "index" : "squid"
> > >> > > > > > > > ,"batchSize" : 100
> > >> > > > > > > > ,"filters" : {
> > >> > > > > > > > "HDFS" : "false"
> > >> > > > > > > > ,"ES" : "exists(field1)"
> > >> > > > > > > > }
> > >> > > > > > > > }
> > >> > > > > > > >
> > >> > > > > > > > *Index On/Off Switch*
> > >> > > > > > > >
> > >> > > > > > > > A simpler solution would be to just
> > provide a
> > >> > > list
> > >> > > > of
> > >> > > > > writers
> > >> > > > > > to
> > >> > > > > > > write
> > >> > > > > > > > messages. The semantics would be as
> > follows:
> > >> > > > > > > >
> > >> > > > > > > > - If the list is unspecified, then the
> > >> > default
> > >> > > > is to
> > >> > > > > write
> > >> > > > > > all
> > >> > > > > > > messages
> > >> > > > > > > > for every writer in the indexing
> > topology
> > >> > > > > > > > - If the list is specified, then a
> > writer
> > >> > will
> > >> > > > write
> > >> > > > > all
> > >> > > > > > messages
> > >> > > > > > > if and
> > >> > > > > > > > only if it is named in the list.
> > >> > > > > > > >
> > >> > > > > > > > Sample indexing config which turns off
> > HDFS
> > >> and
> > >> > > > keeps on
> > >> > > > > > > Elasticsearch:
> > >> > > > > > > > {
> > >> > > > > > > > "index" : "squid"
> > >> > > > > > > > ,"batchSize" : 100
> > >> > > > > > > > ,"writers" : [ "ES" ]
> > >> > > > > > > > }
> > >> > > > > > > >
> > >> > > > > > > > Thanks in advance for the feedback!
> > Also, if
> > >> > you
> > >> > > > have
> > >> > > > > any
> > >> > > > > > other,
> > >> > > > > > > better
> > >> > > > > > > > ideas than the ones presented here,
let me
> > >> know
> > >> > > > too.
> > >> > > > > > > >
> > >> > > > > > > > Best,
> > >> > > > > > > >
> > >> > > > > > > > Casey
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Nick Allen <nick@nickallen.org>
> > >>
> >
>



-- 
Nick Allen <nick@nickallen.org>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message