metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Mon, 16 Jan 2017 15:46:21 GMT
Well, I like it for a couple of reasons:

   - It's explicit and clear that the writer is on or off
   - It enables people to keep their writer config in the file without
   having the writer on (so I don't have to adjust the when clause to "false"
   - It enables us to not have to execute a stellar statement for "off"
   writers.



On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen <nick@nickallen.org> wrote:

> I'm all for a compromise here.  Sounds like we're getting close.
>
> Just one thing.  Can you layout the reasoning for having 'enabled' and
> 'when'?  I don't follow the reasoning, but maybe I am missing something.
>
> On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <
> kylerichardson2@gmail.com
> > wrote:
>
> > I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's
> > enabled property. I also like the idea of a path property for HDFS.
> >
> > -Kyle
> >
> > > On Jan 14, 2017, at 10:51 AM, Casey Stella <cestella@gmail.com> wrote:
> > >
> > > I'm +1 on an explicit enabled property and a filter (or when)
> property. I
> > > think we are zeroing in on a decent design, so that is good.
> > >
> > > To recap, what I am +1 on is Nick's proposed syntax with the following
> > > modifications:
> > > 1. An explicit enabled field
> > > 2. A default on for unspecified to match current semantics
> > >
> > > Casey
> > >> On Sat, Jan 14, 2017 at 10:45 Zeolla@GMail.com <zeolla@gmail.com>
> > wrote:
> > >>
> > >> This has the additional benefit of doing something like below when you
> > want
> > >> to temporarily disable the hdfs writer, but don't want to remove the
> > >> settings.  This removes the need to store the path and batchSize (and
> > many
> > >> additional settings) somewhere else so they can be brought back in
> when
> > you
> > >> want to re-enable it, which is a nice workflow attribute for the end
> > user:
> > >>
> > >> {
> > >>   'elasticsearch': {
> > >>      'enabled': 'true',
> > >>      'index': 'foo',
> > >>      'batchSize': 100,
> > >>    },
> > >>   'hdfs': {
> > >>      'enabled': 'false',
> > >>      'path': '/foo/bar/...',
> > >>      'batchSize': 100,
> > >>    },
> > >>   'solr': {
> > >>      'enabled': 'false'
> > >>    }
> > >> }
> > >>
> > >> Jon
> > >>
> > >>> On Sat, Jan 14, 2017 at 9:24 AM Zeolla@GMail.com <zeolla@gmail.com>
> > wrote:
> > >>>
> > >>> I similarly have a concern there because I prefer being as explicit
> as
> > >>> possible, which makes things easier to pick up for new users.  Using
> my
> > >>> example from earlier this could look like specifying while(false),
> but
> > an
> > >>> even better and more obvious approach may be to use enabled(false).
> So
> > >> the
> > >>> current simple default would be:
> > >>>
> > >>> {
> > >>>   'elasticsearch': { 'enabled': 'true' },
> > >>>   'hdfs': { 'enabled': 'true' },
> > >>>   'solr': { enabled': 'false' }
> > >>> }
> > >>>
> > >>> And to use ES with some overrides but not HDFS or solr it would look
> > >> like:
> > >>>
> > >>> {
> > >>>   'elasticsearch': {
> > >>>      'enabled': 'true',
> > >>>      'index': 'foo',
> > >>>      'batchSize': 100
> > >>>    },
> > >>>   'hdfs': {
> > >>>      'enabled': 'false'
> > >>>    },
> > >>>   'solr': {
> > >>>      'enabled': 'false'
> > >>>    }
> > >>> }
> > >>>
> > >>> Jon
> > >>>
> > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <cestella@gmail.com>
> > >> wrote:
> > >>>
> > >>> One thing that I thought of that I very strenuous do not like in
> Nick's
> > >>> proposal is that if a writer config is not specified then it is
> turned
> > >> off
> > >>> (I think; if I misunderstood let me know). In the situation where we
> > >> have a
> > >>> new sensor, right now if there are no index config and no enrichment
> > >>> config, it still passes through to the index using defaults. In this
> > new
> > >>> scheme it would not. This changes the default semantics for the
> system
> > >> and
> > >>> I think it changes it for the worse.
> > >>>
> > >>> I would strongly prefer a on-by-default indexing config as we have
> now.
> > >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <cestella@gmail.com>
> > wrote:
> > >>>>
> > >>>> One thing that I really like about Nick's suggestion is that it
> allows
> > >>>> writer-specific configs in a clear and simple way.  It is more
> complex
> > >>> for
> > >>>> the default case (all writers write to indices named the same thing
> > >> with
> > >>> a
> > >>>> fixed batch size), which I do not like, but maybe it's worth the
> > >>> compromise
> > >>>> to make it less complex for the advanced case.
> > >>>>
> > >>>> Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> > beginning
> > >>> to
> > >>>> lean your way.
> > >>>>
> > >>>> On Fri, Jan 13, 2017 at 2:51 PM, Zeolla@GMail.com <zeolla@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>> I like the suggestions you made, Nick.  The only thing I would add
> is
> > >>> that
> > >>>> it's also nice to see an explicit when(false), as people newer to
> the
> > >>>> platform may not know where to expect configs for the different
> > >> writers.
> > >>>> Being able to do it either way, which I think is already assumed in
> > >> your
> > >>>> model, would make sense.  I would just suggest that, if we support
> but
> > >>> are
> > >>>> disabling a writer, that the platform inserts a default when(false)
> to
> > >> be
> > >>>> explicit.
> > >>>>
> > >>>> Jon
> > >>>>
> > >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <cestella@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> Let me noodle on this over the weekend.  Your syntax is looking
> less
> > >>>>> onerous to me and I like the following statement from Otto: "In the
> > >>> end,
> > >>>>> each write destination ‘type’ will need it’s own configuration.
> This
> > >>> is
> > >>>> an
> > >>>>> extension point."
> > >>>>>
> > >>>>> I may come around to your way of thinking.
> > >>>>>
> > >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> > >> ottobackwards@gmail.com
> > >>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> In the end, each write destination ‘type’ will need it’s own
> > >>>>>> configuration.  This is an extension point.
> > >>>>>> {
> > >>>>>> HDFS:{
> > >>>>>> outputAdapters:[
> > >>>>>> {name: avro,
> > >>>>>> settings:{
> > >>>>>> avro stuff….
> > >>>>>> when:{
> > >>>>>> },
> > >>>>>> {
> > >>>>>> name: sequence file,
> > >>>>>> …..
> > >>>>>>
> > >>>>>> or some such.
> > >>>>>>
> > >>>>>>
> > >>>>>> On January 13, 2017 at 11:51:15, Nick Allen (nick@nickallen.org)
> > >>>> wrote:
> > >>>>>>
> > >>>>>> I will add also that instead of global overrides, like index, we
> > >>> should
> > >>>>> use
> > >>>>>> configuration key names that are more appropriate to the output.
> > >>>>>>
> > >>>>>> For example, does 'index' really make sense for HDFS? Or would
> > >> 'path'
> > >>>> be
> > >>>>>> more appropriate?
> > >>>>>>
> > >>>>>> {
> > >>>>>> 'elasticsearch': {
> > >>>>>> 'index': 'foo',
> > >>>>>> 'batchSize': 1
> > >>>>>> },
> > >>>>>> 'hdfs': {
> > >>>>>> 'path': '/foo/bar/...',
> > >>>>>> 'batchSize': 100
> > >>>>>> }
> > >>>>>> }
> > >>>>>>
> > >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all
> > >>> this,
> > >>>>>> Casey.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <nick@nickallen.org>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Nick's concerns about my suggestion were that it was overly
> > >> complex
> > >>>> and
> > >>>>>>>> hard to grok and that we could dispense with backwards
> > >>> compatibility
> > >>>>> and
> > >>>>>>>> make people do a bit more work on the default case for the
> > >>> benefits
> > >>>>> of a
> > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> > >>>>> position)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I will add is that in my mind, the majority case would be a user
> > >>>>>>> specifying the outputs, but not things like 'batchSize' or
> > >> 'when'.
> > >>> I
> > >>>>>> think
> > >>>>>>> in the majority case, the user would accept whatever the default
> > >>>> batch
> > >>>>>> size
> > >>>>>>> is.
> > >>>>>>>
> > >>>>>>> Here are alternatives suggestions for all the examples that you
> > >>>>> provided
> > >>>>>>> previously.
> > >>>>>>>
> > >>>>>>> Base Case
> > >>>>>>>
> > >>>>>>> - The user must always specify the 'outputs' for clarity.
> > >>>>>>> - Uses default index name, batch size and when = true.
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> 'elasticsearch': {},
> > >>>>>>> 'hdfs': {}
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> <
> > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > >>>>>> a1#writer-non-specific-case>Writer-non-specific
> > >>>>>>
> > >>>>>>> Case
> > >>>>>>>
> > >>>>>>> - There are no global overrides, as in Casey's proposal.
> > >>>>>>> - Easier to grok IMO.
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> 'elasticsearch': {
> > >>>>>>> 'index': 'foo',
> > >>>>>>> 'batchSize': 100
> > >>>>>>> },
> > >>>>>>> 'hdfs': {
> > >>>>>>> 'index': 'foo',
> > >>>>>>> 'batchSize': 100
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> <
> > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > >>>>>> a1#writer-specific-case-without-filters>Writer-specific
> > >>>>>>
> > >>>>>>> case without filters
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> 'elasticsearch': {
> > >>>>>>> 'index': 'foo',
> > >>>>>>> 'batchSize': 1
> > >>>>>>> },
> > >>>>>>> 'hdfs': {
> > >>>>>>> 'index': 'foo',
> > >>>>>>> 'batchSize': 100
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> <
> > >>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > >>>>>> a1#writer-specific-case-with-filters>Writer-specific
> > >>>>>>
> > >>>>>>> case with filters
> > >>>>>>>
> > >>>>>>> - Instead of having to say when=false, just don't configure HDFS
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> 'elasticsearch': {
> > >>>>>>> 'index': 'foo',
> > >>>>>>> 'batchSize': 100,
> > >>>>>>> 'when': 'exists(field1)'
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
> > >> cestella@gmail.com
> > >>>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Dave,
> > >>>>>>>> For the benefit of posterity and people who might not be as
> > >> deeply
> > >>>>>>>> entangled in the system as we have been, I'll recap things and
> > >>>>> hopefully
> > >>>>>>>> answer your question in the process.
> > >>>>>>>>
> > >>>>>>>> Historically the index configuration is split between the
> > >>> enrichment
> > >>>>>>>> configs and the global configs.
> > >>>>>>>>
> > >>>>>>>> - The global configs really controls configs that apply to all
> > >>>>> sensors.
> > >>>>>>>> Historically this has been stuff like index connection strings,
> > >>> etc.
> > >>>>>>>> - The sensor-specific configs which control things that vary by
> > >>>>> sensor.
> > >>>>>>>>
> > >>>>>>>> As of Metron-652 (in review currently), we moved the sensor
> > >>> specific
> > >>>>>>>> configs from the enrichment configs. The proposal here is to
> > >>>> increase
> > >>>>>> the
> > >>>>>>>> granularity of the the sensor specific files to make them
> > >> support
> > >>>>> index
> > >>>>>>>> writer-specific configs. Right now in the indexing topology, we
> > >>>> have 2
> > >>>>>>>> writers (fixed): ES/Solr and HDFS.
> > >>>>>>>>
> > >>>>>>>> The proposed configuration would allow you to either specify a
> > >>>> blanket
> > >>>>>>>> sensor-level config for the index name and batchSize and/or
> > >>> override
> > >>>>> at
> > >>>>>>>> the
> > >>>>>>>> writer level, thereby supporting a couple of use-cases:
> > >>>>>>>>
> > >>>>>>>> - Turning off certain index writers (e.g. HDFS)
> > >>>>>>>> - Filtering the messages written to certain index writers
> > >>>>>>>>
> > >>>>>>>> The two competing configs between Nick and I are as follows:
> > >>>>>>>>
> > >>>>>>>> - I want to make sure we keep the old sensor-specific defaults
> > >>> with
> > >>>>>>>> writer-specific overrides available
> > >>>>>>>> - Nick thought we could simplify the permutations by making the
> > >>>>>>>> indexing
> > >>>>>>>> config only the writer-level configs.
> > >>>>>>>>
> > >>>>>>>> My concerns about Nick's suggestion were that the default and
> > >>>> majority
> > >>>>>>>> case, specifying the index and the batchSize for all writers (th
> > >>>> eone
> > >>>>> we
> > >>>>>>>> support now) would require more configuration.
> > >>>>>>>>
> > >>>>>>>> Nick's concerns about my suggestion were that it was overly
> > >>> complex
> > >>>>> and
> > >>>>>>>> hard to grok and that we could dispense with backwards
> > >>> compatibility
> > >>>>> and
> > >>>>>>>> make people do a bit more work on the default case for the
> > >>> benefits
> > >>>>> of a
> > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> > >>>>> position).
> > >>>>>>>>
> > >>>>>>>> Casey
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
> > >>> dlyle65535@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Casey,
> > >>>>>>>>>
> > >>>>>>>>> Can you give me a level set of what your thinking is now? I
> > >>> think
> > >>>>> it's
> > >>>>>>>>> global control of all index types + overrides on a per-type
> > >>> basis.
> > >>>>>> Fwiw,
> > >>>>>>>>> I'm totally for that, but I want to make sure I'm not imposing
> > >>> my
> > >>>>>>>>> pre-concieved notions on your consensus-driven ones.
> > >>>>>>>>>
> > >>>>>>>>> -D....
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> > >>>> cestella@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I am suggesting that, yes. The configs are essentially the
> > >>> same
> > >>>> as
> > >>>>>>>>> yours,
> > >>>>>>>>>> except there is an override specified at the top level.
> > >>> Without
> > >>>>>>>> that, in
> > >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100,
> > >> you
> > >>>>> have
> > >>>>>> to
> > >>>>>>>>>> explicitly configure each. It's less that I'm trying to have
> > >>>>>>>> backwards
> > >>>>>>>>>> compatibility and more that I'm trying to make the majority
> > >>> case
> > >>>>>> easy:
> > >>>>>>>>> both
> > >>>>>>>>>> writers write everything to a specified index name with a
> > >>>>> specified
> > >>>>>>>> batch
> > >>>>>>>>>> size (which is what we have now). Beyond that, I want to
> > >> allow
> > >>>> for
> > >>>>>>>>>> specifying an override for the config on a writer-by-writer
> > >>>> basis
> > >>>>>> for
> > >>>>>>>>> those
> > >>>>>>>>>> who need it.
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> > >>>> nick@nickallen.org>
> > >>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Are you saying we support all of these variants? I realize
> > >>> you
> > >>>>> are
> > >>>>>>>>>> trying
> > >>>>>>>>>>> to have some backwards compatibility, but this also makes
> > >> it
> > >>>>>> harder
> > >>>>>>>>> for a
> > >>>>>>>>>>> user to grok (for me at least).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Personally I like my original example as there are fewer
> > >>>>>>>>> sub-structures,
> > >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler
> > >> and
> > >>>>>> easier
> > >>>>>>>> to
> > >>>>>>>>>>> grok. But maybe others will think your proposal is just as
> > >>>> easy
> > >>>>> to
> > >>>>>>>>> grok.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> > >>>>>> cestella@gmail.com>
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and
> > >>> index)
> > >>>>> as
> > >>>>>>>>>>> defaults
> > >>>>>>>>>>>> for the unspecified writer-specific case
> > >>>>>>>>>>>> - Adding the config Nick suggested
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *Base Case*:
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - all writers write all messages
> > >>>>>>>>>>>> - index named the same as the sensor for all writers
> > >>>>>>>>>>>> - batchSize of 1 for all writers
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *Writer-non-specific case*:
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> "index" : "foo"
> > >>>>>>>>>>>> ,"batchSize" : 100
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - All writers write all messages
> > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > >> all
> > >>>>>>>> writers
> > >>>>>>>>>>>> - batchSize is 100 for all writers
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *Writer-specific case without filters*
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> "index" : "foo"
> > >>>>>>>>>>>> ,"batchSize" : 1
> > >>>>>>>>>>>> , "writerConfig" :
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> "elasticsearch" : {
> > >>>>>>>>>>>> "batchSize" : 100
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - All writers write all messages
> > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > >> all
> > >>>>>>>> writers
> > >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch
> > >>> writers
> > >>>>>>>>>>>> - NOTE: I could override the index name too
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *Writer-specific case with filters*
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> "index" : "foo"
> > >>>>>>>>>>>> ,"batchSize" : 1
> > >>>>>>>>>>>> , "writerConfig" :
> > >>>>>>>>>>>> {
> > >>>>>>>>>>>> "elasticsearch" : {
> > >>>>>>>>>>>> "batchSize" : 100,
> > >>>>>>>>>>>> "when" : "exists(field1)"
> > >>>>>>>>>>>> },
> > >>>>>>>>>>>> "hdfs" : {
> > >>>>>>>>>>>> "when" : "false"
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>> }
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS
> > >>> doesn't
> > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > >> all
> > >>>>>>>> writers
> > >>>>>>>>>>>> - 100 for elasticsearch writers
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thoughts?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> > >>>>>>>> cduby@hortonworks.com
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> For larger installations you need to control what is
> > >>>> indexed
> > >>>>>> so
> > >>>>>>>> you
> > >>>>>>>>>>> don’t
> > >>>>>>>>>>>>> end up with a nasty elastic search situation and so
> > >> you
> > >>>> can
> > >>>>>> mine
> > >>>>>>>>> the
> > >>>>>>>>>>> data
> > >>>>>>>>>>>>> later for reports and training ml models.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>> Carolyn
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" <
> > >> cestella@gmail.com
> > >>>>
> > >>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> OH that's a good idea!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> > >>>>>>>> nick@nickallen.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the
> > >>>>>> flexibility
> > >>>>>>>>> that
> > >>>>>>>>>> it
> > >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have
> > >> its
> > >>>> own
> > >>>>>>>>>>>> configuration
> > >>>>>>>>>>>>>>> settings? For example, aren't things like batching
> > >>>>> handled
> > >>>>>>>>>>> separately
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>> HDFS versus Elasticsearch?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Something along the lines of...
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>> "hdfs" : {
> > >>>>>>>>>>>>>>> "when": "exists(field1)",
> > >>>>>>>>>>>>>>> "batchSize": 100
> > >>>>>>>>>>>>>>> },
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> "elasticsearch" : {
> > >>>>>>>>>>>>>>> "when": "true",
> > >>>>>>>>>>>>>>> "batchSize": 1000,
> > >>>>>>>>>>>>>>> "index": "squid"
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> > >>>>>>>>> cestella@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any
> > >>>>> opposition
> > >>>>>>>> to
> > >>>>>>>>>> that
> > >>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>> anyone?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The points brought up are good ones and I think
> > >>> that
> > >>>> it
> > >>>>>>>> may be
> > >>>>>>>>>>>> worth a
> > >>>>>>>>>>>>>>>> broader discussion of the requirements of
> > >> indexing
> > >>>> in a
> > >>>>>>>>> separate
> > >>>>>>>>>>> dev
> > >>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent
> > >>>> use-cases
> > >>>>>>>>>>> justifying
> > >>>>>>>>>>>>> them
> > >>>>>>>>>>>>>>> so
> > >>>>>>>>>>>>>>>> we can think about how this stuff should work and
> > >>>> where
> > >>>>>> the
> > >>>>>>>>>>> natural
> > >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to
> > >>> toe
> > >>>>> the
> > >>>>>>>> line
> > >>>>>>>>>>>> between
> > >>>>>>>>>>>>>>>> engineering and overengineering for features
> > >> nobody
> > >>>>> will
> > >>>>>>>> want.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard
> > >>>>> fields.
> > >>>>>>>> I'm
> > >>>>>>>>>>> torn
> > >>>>>>>>>>>>>>> between
> > >>>>>>>>>>>>>>>> the notions that we should have no standard
> > >> fields
> > >>> vs
> > >>>>> we
> > >>>>>>>>> should
> > >>>>>>>>>>>> have a
> > >>>>>>>>>>>>>>>> boatload of standard fields (with most of them
> > >>>> empty).
> > >>>>> I
> > >>>>>>>>>> exchange
> > >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;)
> > >> It
> > >>>> may
> > >>>>> be
> > >>>>>>>>>> worth a
> > >>>>>>>>>>>> dev
> > >>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>> discussion to lay out how you imagine an
> > >> extension
> > >>> of
> > >>>>>>>> standard
> > >>>>>>>>>>>> fields
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> how it might look as implemented in Metron.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Casey
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Casey
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson
> > >> <
> > >>>>>>>>>>>>>>>> kylerichardson2@gmail.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I'll second my preference for the first
> > >> option. I
> > >>>>> think
> > >>>>>>>> the
> > >>>>>>>>>>>> ability
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be
> > >> a
> > >>>> big
> > >>>>>> win.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data
> > >>> lake
> > >>>>> and
> > >>>>>>>> CEP.
> > >>>>>>>>> I
> > >>>>>>>>>>>> think
> > >>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>> a really important use case that we need to
> > >>>> consider.
> > >>>>>>>> Take a
> > >>>>>>>>>>>> simple
> > >>>>>>>>>>>>>>>>> example... If I have data coming in from 3
> > >>>> different
> > >>>>>>>>> firewall
> > >>>>>>>>>>>>> vendors
> > >>>>>>>>>>>>>>>> and 2
> > >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I
> > >>>> want
> > >>>>> to
> > >>>>>>>> be
> > >>>>>>>>>> able
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> analyze
> > >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed
> > >> all
> > >>>>>> together
> > >>>>>>>>>>> (likely
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>> HDFS)
> > >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP
> > >>>> address,
> > >>>>>>>> URL,
> > >>>>>>>>> and
> > >>>>>>>>>>>> user
> > >>>>>>>>>>>>>>> name
> > >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and
> > >>>>> aggregated. I
> > >>>>>>>> can
> > >>>>>>>>>> also
> > >>>>>>>>>>>>>>> envision
> > >>>>>>>>>>>>>>>>> scenarios where I would want to index data
> > >> based
> > >>> on
> > >>>>>>>>> attributes
> > >>>>>>>>>>>> other
> > >>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for
> > >> example.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7
> > >>>> standard
> > >>>>>>>> fields
> > >>>>>>>>> to
> > >>>>>>>>>>>>> include
> > >>>>>>>>>>>>>>>>> things like URL and user. Is there community
> > >>>>>>>>> interest/support
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>> moving
> > >>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> -Kyle
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> > >>>>>>>>> mattf@apache.org
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index
> > >> name
> > >>>>>> allows
> > >>>>>>>>>> using
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can
> > >> be
> > >>>>>>>> achieved.
> > >>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>> --Matt
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" <
> > >>>>>>>> cestella@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog
> > >>> parser
> > >>>>>>>>> with
> > >>>>>>>>>>> data
> > >>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>> sources 1
> > >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue
> > >>> with 3
> > >>>>>>>>>> parsers
> > >>>>>>>>>>>>>>> attached
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>> queue, each picking part the messages from
> > >>> source
> > >>>>>>>> 1, 2
> > >>>>>>>>>> and
> > >>>>>>>>>>>> 3.
> > >>>>>>>>>>>>>>>> They'd
> > >>>>>>>>>>>>>>>>>> go
> > >>>>>>>>>>>>>>>>>> through separate enrichment and into the
> > >>> indexing
> > >>>>>>>>>>> topology.
> > >>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same
> > >>>> index
> > >>>>>>>>> name
> > >>>>>>>>>>>>> "syslog"
> > >>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>> of the messages go into the same index for
> > >> CEP
> > >>>>>>>>> querying
> > >>>>>>>>>> if
> > >>>>>>>>>>>> so
> > >>>>>>>>>>>>>>>>> desired.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> > >>>>>>>>>>>> mattf@apache.org
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I
> > >> worked
> > >>> at
> > >>>>>>>>>> LogLogic
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> previous
> > >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route
> > >>> different
> > >>>>>>>>> lines
> > >>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>> syslog
> > >>>>>>>>>>>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of
> > >>>> what
> > >>>>>>>>> the
> > >>>>>>>>>>>>> parsers
> > >>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and
> > >>>>>>>> annotate
> > >>>>>>>>>> it
> > >>>>>>>>>>> –
> > >>>>>>>>>>>>> eg,
> > >>>>>>>>>>>>>>>>>> src_ip_addr,
> > >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata
> > >>> are
> > >>>>>>>>>>> annotated
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> available
> > >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it
> > >> make
> > >>>>>>>> sense
> > >>>>>>>>> to
> > >>>>>>>>>>>> index
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> messages
> > >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk
> > >>> has
> > >>>>>>>>>>>> illustrated
> > >>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>> model.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" <
> > >>>>>>>>>> cestella@gmail.com
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the
> > >> approach
> > >>>>>>>>> that
> > >>>>>>>>>>>> we've
> > >>>>>>>>>>>>>>> taken
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> sources
> > >>>>>>>>>>>>>>>>>>> which aggregate different types of data is
> > >> to
> > >>>>>>>>>>> provide
> > >>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> parser
> > >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies
> > >>>>>>>> (with
> > >>>>>>>>>>>>> different,
> > >>>>>>>>>>>>>>>>>> possibly
> > >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This
> > >>>>>>>> would
> > >>>>>>>>>> be
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>> completely
> > >>>>>>>>>>>>>>>>>>> separate
> > >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that
> > >>>>>>>>>>> aggregates
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>> want to
> > >>>>>>>>>>>>>>>>>>> pick
> > >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is
> > >>>>>>>> why
> > >>>>>>>>> the
> > >>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>>> thought and
> > >>>>>>>>>>>>>>>>>>> architecture was one index per sensor.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt
> > >> Foley <
> > >>>>>>>>>>>>>>>> mattf@apache.org>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event
> > >>>>>>>>> Processing)
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>>> contrary
> > >>>>>>>>>>>>>>>>>> to the
> > >>>>>>>>>>>>>>>>>>> idea
> > >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor.
> > >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors
> > >>>>>>>> are
> > >>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>>> aggregating
> > >>>>>>>>>>>>>>>>>>> data from
> > >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong
> > >> here.
> > >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data
> > >>>>>>>> lake”
> > >>>>>>>>>>>> insights
> > >>>>>>>>>>>>>>> come
> > >>>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>> being able
> > >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of
> > >>>>>>>> data
> > >>>>>>>>>>> rather
> > >>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>> vertical
> > >>>>>>>>>>>>>>>>>>>> slices of it.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" <
> > >>>>>>>>>>>>> cestella@gmail.com>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hey Matt,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the comment!
> > >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one
> > >>>>>>>> index
> > >>>>>>>>>> name,
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> which is
> > >>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to
> > >>>>>>>> the
> > >>>>>>>>>>> user.
> > >>>>>>>>>>>>> This
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>> sensor
> > >>>>>>>>>>>>>>>>>>>> specific,
> > >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each
> > >>>>>>>>>> sensor.
> > >>>>>>>>>>>> If
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>> want
> > >>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> build
> > >>>>>>>>>>>>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think
> > >>>>>>>>>>> carefully
> > >>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>> how
> > >>>>>>>>>>>>>>>>>> to do
> > >>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I
> > >>>>>>>> guess I
> > >>>>>>>>>> can
> > >>>>>>>>>>>> see
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> use,
> > >>>>>>>>>>>>>>>>>> though
> > >>>>>>>>>>>>>>>>>>>> (redirect
> > >>>>>>>>>>>>>>>>>>>> messages to one index vs another based
> > >>>>>>>> on
> > >>>>>>>>> a
> > >>>>>>>>>>>>> predicate
> > >>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> a given
> > >>>>>>>>>>>>>>>>>>>> sensor).
> > >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally
> > >>>>>>>>> thinking
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>> discussion
> > >>>>>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>> go,
> > >>>>>>>>>>>>>>>>>>>> but it's an interesting point.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the
> > >>>>>>>>>> implementation
> > >>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>> yet,
> > >>>>>>>>>>>>>>>>>> but we
> > >>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that
> > >>>>>>>>>>> topology,
> > >>>>>>>>>>>>> just
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> spout
> > >>>>>>>>>>>>>>>>>>> that goes
> > >>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to
> > >>>>>>>> the
> > >>>>>>>>>> hdfs
> > >>>>>>>>>>>>> writer.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt
> > >>>>>>>>> Foley
> > >>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>> mattf@apache.org>
> > >>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like
> > >>>>>>>> this.
> > >>>>>>>>>>>> Couple
> > >>>>>>>>>>>>>>>>>> questions:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid”
> > >>>>>>>>>>> name/value
> > >>>>>>>>>>>>> pair,
> > >>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> index name
> > >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor
> > >>>>>>>> name? Or
> > >>>>>>>>>> is
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> given
> > >>>>>>>>>>>>>>>>>> json
> > >>>>>>>>>>>>>>>>>>> structure
> > >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in
> > >>>>>>>>> zookeeper?
> > >>>>>>>>>>> Or
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>> build
> > >>>>>>>>>>>>>>>>>>> arbitrary
> > >>>>>>>>>>>>>>>>>>>>> indexes with this new specification,
> > >>>>>>>>>>>>> independent of
> > >>>>>>>>>>>>>>>>>> sensor?
> > >>>>>>>>>>>>>>>>>>> Should
> > >>>>>>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie
> > >>>>>>>>>>>>>>>>>>>>> { “indexes” : [
> > >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”,
> > >>>>>>>>>>>>>>>>>>>>> …
> > >>>>>>>>>>>>>>>>>>>>> },
> > >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”,
> > >>>>>>>>>>>>>>>>>>>>> …
> > >>>>>>>>>>>>>>>>>>>>> } ]
> > >>>>>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer
> > >>>>>>>>> selection
> > >>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>> take
> > >>>>>>>>>>>>>>>>>> place in
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> indexing
> > >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like
> > >>>>>>>> that
> > >>>>>>>>>>> would
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> smallest
> > >>>>>>>>>>>>>>>>>>>> impact on
> > >>>>>>>>>>>>>>>>>>>>> current implementation, no?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered
> > >>>>>>>> in
> > >>>>>>>>>>>> PR-415, I
> > >>>>>>>>>>>>>>>>> haven’t
> > >>>>>>>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>>>>> time to
> > >>>>>>>>>>>>>>>>>>>>> review that one yet.
> > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>> --Matt
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael
> > >>>>>>>>> Miklavcic"
> > >>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>> michael.miklavcic@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> I like the flexibility and
> > >>>>>>>>>>> expressibility
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> first
> > >>>>>>>>>>>>>>>>>>> option
> > >>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>> Stellar
> > >>>>>>>>>>>>>>>>>>>>> filters.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> M
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM,
> > >>>>>>>>> Casey
> > >>>>>>>>>>>>> Stella <
> > >>>>>>>>>>>>>>>>>>>> cestella@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 <
> > >>>>>>>>>>>>> https://github.com/apache/
> > >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we
> > >>>>>>>>>>>>>>>>>>>>>> will have decoupled the
> > >>>>>>>> indexing
> > >>>>>>>>>>>>>>> configuration
> > >>>>>>>>>>>>>>>>>> from the
> > >>>>>>>>>>>>>>>>>>>> enrichment
> > >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate
> > >>>>>>>>>>>> follow-up
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> that,
> > >>>>>>>>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>>>> like to
> > >>>>>>>>>>>>>>>>>>>>> provide the
> > >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on
> > >>>>>>>> writers
> > >>>>>>>>>> via
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> configs. I'd
> > >>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>> to get
> > >>>>>>>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>> community feedback on how the
> > >>>>>>>>>>>>> functionality
> > >>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>> work,
> > >>>>>>>>>>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>>>>> y'all are
> > >>>>>>>>>>>>>>>>>>>>>> amenable. :)
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible
> > >>>>>>>>>> writers
> > >>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>>>>> in the
> > >>>>>>>>>>>>>>>>>>>>> indexing
> > >>>>>>>>>>>>>>>>>>>>>> topology:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> - Solr
> > >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch
> > >>>>>>>>>>>>>>>>>>>>>> - HDFS
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> HDFS is always used,
> > >>>>>>>> elasticsearch
> > >>>>>>>>>> or
> > >>>>>>>>>>>>> solr is
> > >>>>>>>>>>>>>>>>> used
> > >>>>>>>>>>>>>>>>>>> depending
> > >>>>>>>>>>>>>>>>>>>> on how
> > >>>>>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>> start the indexing topology.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to
> > >>>>>>>> mind
> > >>>>>>>>>>>>>>> immediately:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> *Index Filtering*
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a
> > >>>>>>>>>> filter
> > >>>>>>>>>>> as
> > >>>>>>>>>>>>>>>> defined
> > >>>>>>>>>>>>>>>>>> by a
> > >>>>>>>>>>>>>>>>>>> stellar
> > >>>>>>>>>>>>>>>>>>>>> statement
> > >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the
> > >>>>>>>>> StellarFilter
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>> exists
> > >>>>>>>>>>>>>>>>>> in the
> > >>>>>>>>>>>>>>>>>>>> Parsers)
> > >>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on
> > >>>>>>>> a
> > >>>>>>>>>>>>>>>>>> message-by-message basis
> > >>>>>>>>>>>>>>>>>>>> whether or
> > >>>>>>>>>>>>>>>>>>>>> not to
> > >>>>>>>>>>>>>>>>>>>>>> write the message.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be
> > >>>>>>>> as
> > >>>>>>>>>>>> follows:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> - Default (i.e.
> > >>>>>>>> unspecified) is
> > >>>>>>>>>> to
> > >>>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>>> (hence
> > >>>>>>>>>>>>>>>>>>>>>> backwards compatible with
> > >>>>>>>> the
> > >>>>>>>>>>> current
> > >>>>>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>> config).
> > >>>>>>>>>>>>>>>>>>>>>> - Messages which have the
> > >>>>>>>>>>> associated
> > >>>>>>>>>>>>>>> stellar
> > >>>>>>>>>>>>>>>>>> statement
> > >>>>>>>>>>>>>>>>>>>> evaluate
> > >>>>>>>>>>>>>>>>>>>>> to true
> > >>>>>>>>>>>>>>>>>>>>>> for the writer type will be
> > >>>>>>>>>>> written,
> > >>>>>>>>>>>>>>>> otherwise
> > >>>>>>>>>>>>>>>>>> not.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > >>>>>>>> would
> > >>>>>>>>>>> write
> > >>>>>>>>>>>>> out
> > >>>>>>>>>>>>>>> no
> > >>>>>>>>>>>>>>>>>> messages
> > >>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> HDFS and
> > >>>>>>>>>>>>>>>>>>>>> write
> > >>>>>>>>>>>>>>>>>>>>>> out only messages containing a
> > >>>>>>>>> field
> > >>>>>>>>>>>>> called
> > >>>>>>>>>>>>>>>>>> "field1":
> > >>>>>>>>>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > >>>>>>>>>>>>>>>>>>>>>> ,"filters" : {
> > >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false"
> > >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)"
> > >>>>>>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch*
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to
> > >>>>>>>>> just
> > >>>>>>>>>>>>> provide a
> > >>>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>> writers
> > >>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>> write
> > >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would
> > >>>>>>>> be
> > >>>>>>>>> as
> > >>>>>>>>>>>>> follows:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> - If the list is
> > >>>>>>>> unspecified,
> > >>>>>>>>>> then
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>> is to
> > >>>>>>>>>>>>>>>>>>> write
> > >>>>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>> messages
> > >>>>>>>>>>>>>>>>>>>>>> for every writer in the
> > >>>>>>>>> indexing
> > >>>>>>>>>>>>> topology
> > >>>>>>>>>>>>>>>>>>>>>> - If the list is specified,
> > >>>>>>>>> then
> > >>>>>>>>>> a
> > >>>>>>>>>>>>> writer
> > >>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>> write
> > >>>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>> messages
> > >>>>>>>>>>>>>>>>>>>>> if and
> > >>>>>>>>>>>>>>>>>>>>>> only if it is named in the
> > >>>>>>>>> list.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > >>>>>>>> turns
> > >>>>>>>>>> off
> > >>>>>>>>>>>>> HDFS
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> keeps on
> > >>>>>>>>>>>>>>>>>>>>> Elasticsearch:
> > >>>>>>>>>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ]
> > >>>
> > >>> --
> > >>
> > >> Jon
> > >>
> > >> Sent from my mobile device
> > >>
> >
> >
>
>
> --
> Nick Allen <nick@nickallen.org>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message