metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Sirota <jsir...@apache.org>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Mon, 16 Jan 2017 21:01:53 GMT
The explicit on/off seems like a good option to have.  This way I don't have to completely remove the config block in order for me to test something.  I think if the config for the writer is unspecified we should throw up a warning.  

16.01.2017, 08:54, "Nick Allen" <nick@nickallen.org>:
>>  To recap, what I am +1 on is Nick's proposed syntax with the following
>>  modifications:
>>  1. An explicit enabled field
>>  2. A default on for unspecified to match current semantics
>
> I'm +1 on all of this.
>
> On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella <cestella@gmail.com> wrote:
>
>>  I'm +1 on an explicit enabled property and a filter (or when) property. I
>>  think we are zeroing in on a decent design, so that is good.
>>
>>  To recap, what I am +1 on is Nick's proposed syntax with the following
>>  modifications:
>>  1. An explicit enabled field
>>  2. A default on for unspecified to match current semantics
>>
>>  Casey
>>  On Sat, Jan 14, 2017 at 10:45 Zeolla@GMail.com <zeolla@gmail.com> wrote:
>>
>>  > This has the additional benefit of doing something like below when you
>>  want
>>  > to temporarily disable the hdfs writer, but don't want to remove the
>>  > settings. This removes the need to store the path and batchSize (and
>>  many
>>  > additional settings) somewhere else so they can be brought back in when
>>  you
>>  > want to re-enable it, which is a nice workflow attribute for the end
>>  user:
>>  >
>>  > {
>>  > 'elasticsearch': {
>>  > 'enabled': 'true',
>>  > 'index': 'foo',
>>  > 'batchSize': 100,
>>  > },
>>  > 'hdfs': {
>>  > 'enabled': 'false',
>>  > 'path': '/foo/bar/...',
>>  > 'batchSize': 100,
>>  > },
>>  > 'solr': {
>>  > 'enabled': 'false'
>>  > }
>>  > }
>>  >
>>  > Jon
>>  >
>>  > On Sat, Jan 14, 2017 at 9:24 AM Zeolla@GMail.com <zeolla@gmail.com>
>>  wrote:
>>  >
>>  > > I similarly have a concern there because I prefer being as explicit as
>>  > > possible, which makes things easier to pick up for new users. Using my
>>  > > example from earlier this could look like specifying while(false), but
>>  an
>>  > > even better and more obvious approach may be to use enabled(false). So
>>  > the
>>  > > current simple default would be:
>>  > >
>>  > > {
>>  > > 'elasticsearch': { 'enabled': 'true' },
>>  > > 'hdfs': { 'enabled': 'true' },
>>  > > 'solr': { enabled': 'false' }
>>  > > }
>>  > >
>>  > > And to use ES with some overrides but not HDFS or solr it would look
>>  > like:
>>  > >
>>  > > {
>>  > > 'elasticsearch': {
>>  > > 'enabled': 'true',
>>  > > 'index': 'foo',
>>  > > 'batchSize': 100
>>  > > },
>>  > > 'hdfs': {
>>  > > 'enabled': 'false'
>>  > > },
>>  > > 'solr': {
>>  > > 'enabled': 'false'
>>  > > }
>>  > > }
>>  > >
>>  > > Jon
>>  > >
>>  > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <cestella@gmail.com>
>>  > wrote:
>>  > >
>>  > > One thing that I thought of that I very strenuous do not like in Nick's
>>  > > proposal is that if a writer config is not specified then it is turned
>>  > off
>>  > > (I think; if I misunderstood let me know). In the situation where we
>>  > have a
>>  > > new sensor, right now if there are no index config and no enrichment
>>  > > config, it still passes through to the index using defaults. In this
>>  new
>>  > > scheme it would not. This changes the default semantics for the system
>>  > and
>>  > > I think it changes it for the worse.
>>  > >
>>  > > I would strongly prefer a on-by-default indexing config as we have now.
>>  > > On Fri, Jan 13, 2017 at 17:13 Casey Stella <cestella@gmail.com> wrote:
>>  > >
>>  > > > One thing that I really like about Nick's suggestion is that it
>>  allows
>>  > > > writer-specific configs in a clear and simple way. It is more
>>  complex
>>  > > for
>>  > > > the default case (all writers write to indices named the same thing
>>  > with
>>  > > a
>>  > > > fixed batch size), which I do not like, but maybe it's worth the
>>  > > compromise
>>  > > > to make it less complex for the advanced case.
>>  > > >
>>  > > > Thanks a lot for the suggestion, Nick, it's interesting; I'm
>>  beginning
>>  > > to
>>  > > > lean your way.
>>  > > >
>>  > > > On Fri, Jan 13, 2017 at 2:51 PM, Zeolla@GMail.com <zeolla@gmail.com>
>>  > > > wrote:
>>  > > >
>>  > > > I like the suggestions you made, Nick. The only thing I would add is
>>  > > that
>>  > > > it's also nice to see an explicit when(false), as people newer to the
>>  > > > platform may not know where to expect configs for the different
>>  > writers.
>>  > > > Being able to do it either way, which I think is already assumed in
>>  > your
>>  > > > model, would make sense. I would just suggest that, if we support
>>  but
>>  > > are
>>  > > > disabling a writer, that the platform inserts a default when(false)
>>  to
>>  > be
>>  > > > explicit.
>>  > > >
>>  > > > Jon
>>  > > >
>>  > > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <cestella@gmail.com>
>>  > > wrote:
>>  > > >
>>  > > > > Let me noodle on this over the weekend. Your syntax is looking
>>  less
>>  > > > > onerous to me and I like the following statement from Otto: "In the
>>  > > end,
>>  > > > > each write destination ‘type’ will need it’s own configuration.
>>  This
>>  > > is
>>  > > > an
>>  > > > > extension point."
>>  > > > >
>>  > > > > I may come around to your way of thinking.
>>  > > > >
>>  > > > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
>>  > ottobackwards@gmail.com
>>  > > >
>>  > > > > wrote:
>>  > > > >
>>  > > > > > In the end, each write destination ‘type’ will need it’s own
>>  > > > > > configuration. This is an extension point.
>>  > > > > > {
>>  > > > > > HDFS:{
>>  > > > > > outputAdapters:[
>>  > > > > > {name: avro,
>>  > > > > > settings:{
>>  > > > > > avro stuff….
>>  > > > > > when:{
>>  > > > > > },
>>  > > > > > {
>>  > > > > > name: sequence file,
>>  > > > > > …..
>>  > > > > >
>>  > > > > > or some such.
>>  > > > > >
>>  > > > > >
>>  > > > > > On January 13, 2017 at 11:51:15, Nick Allen (nick@nickallen.org)
>>  > > > wrote:
>>  > > > > >
>>  > > > > > I will add also that instead of global overrides, like index, we
>>  > > should
>>  > > > > use
>>  > > > > > configuration key names that are more appropriate to the output.
>>  > > > > >
>>  > > > > > For example, does 'index' really make sense for HDFS? Or would
>>  > 'path'
>>  > > > be
>>  > > > > > more appropriate?
>>  > > > > >
>>  > > > > > {
>>  > > > > > 'elasticsearch': {
>>  > > > > > 'index': 'foo',
>>  > > > > > 'batchSize': 1
>>  > > > > > },
>>  > > > > > 'hdfs': {
>>  > > > > > 'path': '/foo/bar/...',
>>  > > > > > 'batchSize': 100
>>  > > > > > }
>>  > > > > > }
>>  > > > > >
>>  > > > > > Ok, I've said my peace. Thanks for the effort in summarizing all
>>  > > this,
>>  > > > > > Casey.
>>  > > > > >
>>  > > > > >
>>  > > > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <nick@nickallen.org
>>  >
>>  > > > wrote:
>>  > > > > >
>>  > > > > > > Nick's concerns about my suggestion were that it was overly
>>  > complex
>>  > > > and
>>  > > > > > >> hard to grok and that we could dispense with backwards
>>  > > compatibility
>>  > > > > and
>>  > > > > > >> make people do a bit more work on the default case for the
>>  > > benefits
>>  > > > > of a
>>  > > > > > >> simpler advanced case. (Nick, make sure I don't misstate your
>>  > > > > position)
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > I will add is that in my mind, the majority case would be a
>>  user
>>  > > > > > > specifying the outputs, but not things like 'batchSize' or
>>  > 'when'.
>>  > > I
>>  > > > > > think
>>  > > > > > > in the majority case, the user would accept whatever the
>>  default
>>  > > > batch
>>  > > > > > size
>>  > > > > > > is.
>>  > > > > > >
>>  > > > > > > Here are alternatives suggestions for all the examples that you
>>  > > > > provided
>>  > > > > > > previously.
>>  > > > > > >
>>  > > > > > > Base Case
>>  > > > > > >
>>  > > > > > > - The user must always specify the 'outputs' for clarity.
>>  > > > > > > - Uses default index name, batch size and when = true.
>>  > > > > > >
>>  > > > > > > {
>>  > > > > > > 'elasticsearch': {},
>>  > > > > > > 'hdfs': {}
>>  > > > > > > }
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > <
>>  > > > > > https://gist.github.com/nickwallen/
>>  489735b65cdb38aae6e45cec7633a0
>>  > > > > > a1#writer-non-specific-case>Writer-non-specific
>>  > > > > >
>>  > > > > > > Case
>>  > > > > > >
>>  > > > > > > - There are no global overrides, as in Casey's proposal.
>>  > > > > > > - Easier to grok IMO.
>>  > > > > > >
>>  > > > > > > {
>>  > > > > > > 'elasticsearch': {
>>  > > > > > > 'index': 'foo',
>>  > > > > > > 'batchSize': 100
>>  > > > > > > },
>>  > > > > > > 'hdfs': {
>>  > > > > > > 'index': 'foo',
>>  > > > > > > 'batchSize': 100
>>  > > > > > > }
>>  > > > > > > }
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > <
>>  > > > > > https://gist.github.com/nickwallen/
>>  489735b65cdb38aae6e45cec7633a0
>>  > > > > > a1#writer-specific-case-without-filters>Writer-specific
>>  > > > > >
>>  > > > > > > case without filters
>>  > > > > > >
>>  > > > > > > {
>>  > > > > > > 'elasticsearch': {
>>  > > > > > > 'index': 'foo',
>>  > > > > > > 'batchSize': 1
>>  > > > > > > },
>>  > > > > > > 'hdfs': {
>>  > > > > > > 'index': 'foo',
>>  > > > > > > 'batchSize': 100
>>  > > > > > > }
>>  > > > > > > }
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > <
>>  > > > > > https://gist.github.com/nickwallen/
>>  489735b65cdb38aae6e45cec7633a0
>>  > > > > > a1#writer-specific-case-with-filters>Writer-specific
>>  > > > > >
>>  > > > > > > case with filters
>>  > > > > > >
>>  > > > > > > - Instead of having to say when=false, just don't configure
>>  HDFS
>>  > > > > > >
>>  > > > > > > {
>>  > > > > > > 'elasticsearch': {
>>  > > > > > > 'index': 'foo',
>>  > > > > > > 'batchSize': 100,
>>  > > > > > > 'when': 'exists(field1)'
>>  > > > > > > }
>>  > > > > > > }
>>  > > > > > >
>>  > > > > > >
>>  > > > > > >
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
>>  > cestella@gmail.com
>>  > > >
>>  > > > > > wrote:
>>  > > > > > >
>>  > > > > > >> Dave,
>>  > > > > > >> For the benefit of posterity and people who might not be as
>>  > deeply
>>  > > > > > >> entangled in the system as we have been, I'll recap things and
>>  > > > > hopefully
>>  > > > > > >> answer your question in the process.
>>  > > > > > >>
>>  > > > > > >> Historically the index configuration is split between the
>>  > > enrichment
>>  > > > > > >> configs and the global configs.
>>  > > > > > >>
>>  > > > > > >> - The global configs really controls configs that apply to all
>>  > > > > sensors.
>>  > > > > > >> Historically this has been stuff like index connection
>>  strings,
>>  > > etc.
>>  > > > > > >> - The sensor-specific configs which control things that vary
>>  by
>>  > > > > sensor.
>>  > > > > > >>
>>  > > > > > >> As of Metron-652 (in review currently), we moved the sensor
>>  > > specific
>>  > > > > > >> configs from the enrichment configs. The proposal here is to
>>  > > > increase
>>  > > > > > the
>>  > > > > > >> granularity of the the sensor specific files to make them
>>  > support
>>  > > > > index
>>  > > > > > >> writer-specific configs. Right now in the indexing topology,
>>  we
>>  > > > have 2
>>  > > > > > >> writers (fixed): ES/Solr and HDFS.
>>  > > > > > >>
>>  > > > > > >> The proposed configuration would allow you to either specify a
>>  > > > blanket
>>  > > > > > >> sensor-level config for the index name and batchSize and/or
>>  > > override
>>  > > > > at
>>  > > > > > >> the
>>  > > > > > >> writer level, thereby supporting a couple of use-cases:
>>  > > > > > >>
>>  > > > > > >> - Turning off certain index writers (e.g. HDFS)
>>  > > > > > >> - Filtering the messages written to certain index writers
>>  > > > > > >>
>>  > > > > > >> The two competing configs between Nick and I are as follows:
>>  > > > > > >>
>>  > > > > > >> - I want to make sure we keep the old sensor-specific defaults
>>  > > with
>>  > > > > > >> writer-specific overrides available
>>  > > > > > >> - Nick thought we could simplify the permutations by making
>>  the
>>  > > > > > >> indexing
>>  > > > > > >> config only the writer-level configs.
>>  > > > > > >>
>>  > > > > > >> My concerns about Nick's suggestion were that the default and
>>  > > > majority
>>  > > > > > >> case, specifying the index and the batchSize for all writers
>>  (th
>>  > > > eone
>>  > > > > we
>>  > > > > > >> support now) would require more configuration.
>>  > > > > > >>
>>  > > > > > >> Nick's concerns about my suggestion were that it was overly
>>  > > complex
>>  > > > > and
>>  > > > > > >> hard to grok and that we could dispense with backwards
>>  > > compatibility
>>  > > > > and
>>  > > > > > >> make people do a bit more work on the default case for the
>>  > > benefits
>>  > > > > of a
>>  > > > > > >> simpler advanced case. (Nick, make sure I don't misstate your
>>  > > > > position).
>>  > > > > > >>
>>  > > > > > >> Casey
>>  > > > > > >>
>>  > > > > > >>
>>  > > > > > >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
>>  > > dlyle65535@gmail.com>
>>  > > > > > >> wrote:
>>  > > > > > >>
>>  > > > > > >> > Casey,
>>  > > > > > >> >
>>  > > > > > >> > Can you give me a level set of what your thinking is now? I
>>  > > think
>>  > > > > it's
>>  > > > > > >> > global control of all index types + overrides on a per-type
>>  > > basis.
>>  > > > > > Fwiw,
>>  > > > > > >> > I'm totally for that, but I want to make sure I'm not
>>  imposing
>>  > > my
>>  > > > > > >> > pre-concieved notions on your consensus-driven ones.
>>  > > > > > >> >
>>  > > > > > >> > -D....
>>  > > > > > >> >
>>  > > > > > >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
>>  > > > cestella@gmail.com>
>>  > > > > > >> wrote:
>>  > > > > > >> >
>>  > > > > > >> > > I am suggesting that, yes. The configs are essentially the
>>  > > same
>>  > > > as
>>  > > > > > >> > yours,
>>  > > > > > >> > > except there is an override specified at the top level.
>>  > > Without
>>  > > > > > >> that, in
>>  > > > > > >> > > order to specify both HDFS and ES have batch sizes of 100,
>>  > you
>>  > > > > have
>>  > > > > > to
>>  > > > > > >> > > explicitly configure each. It's less that I'm trying to
>>  have
>>  > > > > > >> backwards
>>  > > > > > >> > > compatibility and more that I'm trying to make the
>>  majority
>>  > > case
>>  > > > > > easy:
>>  > > > > > >> > both
>>  > > > > > >> > > writers write everything to a specified index name with a
>>  > > > > specified
>>  > > > > > >> batch
>>  > > > > > >> > > size (which is what we have now). Beyond that, I want to
>>  > allow
>>  > > > for
>>  > > > > > >> > > specifying an override for the config on a
>>  writer-by-writer
>>  > > > basis
>>  > > > > > for
>>  > > > > > >> > those
>>  > > > > > >> > > who need it.
>>  > > > > > >> > >
>>  > > > > > >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
>>  > > > nick@nickallen.org>
>>  > > > > > >> wrote:
>>  > > > > > >> > >
>>  > > > > > >> > > > Are you saying we support all of these variants? I
>>  realize
>>  > > you
>>  > > > > are
>>  > > > > > >> > > trying
>>  > > > > > >> > > > to have some backwards compatibility, but this also
>>  makes
>>  > it
>>  > > > > > harder
>>  > > > > > >> > for a
>>  > > > > > >> > > > user to grok (for me at least).
>>  > > > > > >> > > >
>>  > > > > > >> > > > Personally I like my original example as there are fewer
>>  > > > > > >> > sub-structures,
>>  > > > > > >> > > > like 'writerConfig', which makes the whole thing simpler
>>  > and
>>  > > > > > easier
>>  > > > > > >> to
>>  > > > > > >> > > > grok. But maybe others will think your proposal is just
>>  as
>>  > > > easy
>>  > > > > to
>>  > > > > > >> > grok.
>>  > > > > > >> > > >
>>  > > > > > >> > > >
>>  > > > > > >> > > >
>>  > > > > > >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
>>  > > > > > cestella@gmail.com>
>>  > > > > >
>>  > > > > > >> > > wrote:
>>  > > > > > >> > > >
>>  > > > > > >> > > > > Ok, so here's what I'm thinking based on the
>>  discussion:
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > - Keeping the configs that we have now (batchSize and
>>  > > index)
>>  > > > > as
>>  > > > > > >> > > > defaults
>>  > > > > > >> > > > > for the unspecified writer-specific case
>>  > > > > > >> > > > > - Adding the config Nick suggested
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > *Base Case*:
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > - all writers write all messages
>>  > > > > > >> > > > > - index named the same as the sensor for all writers
>>  > > > > > >> > > > > - batchSize of 1 for all writers
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > *Writer-non-specific case*:
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > "index" : "foo"
>>  > > > > > >> > > > > ,"batchSize" : 100
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > - All writers write all messages
>>  > > > > > >> > > > > - index is named "foo", different from the sensor for
>>  > all
>>  > > > > > >> writers
>>  > > > > > >> > > > > - batchSize is 100 for all writers
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > *Writer-specific case without filters*
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > "index" : "foo"
>>  > > > > > >> > > > > ,"batchSize" : 1
>>  > > > > > >> > > > > , "writerConfig" :
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > "elasticsearch" : {
>>  > > > > > >> > > > > "batchSize" : 100
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > - All writers write all messages
>>  > > > > > >> > > > > - index is named "foo", different from the sensor for
>>  > all
>>  > > > > > >> writers
>>  > > > > > >> > > > > - batchSize is 1 for HDFS and 100 for elasticsearch
>>  > > writers
>>  > > > > > >> > > > > - NOTE: I could override the index name too
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > *Writer-specific case with filters*
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > "index" : "foo"
>>  > > > > > >> > > > > ,"batchSize" : 1
>>  > > > > > >> > > > > , "writerConfig" :
>>  > > > > > >> > > > > {
>>  > > > > > >> > > > > "elasticsearch" : {
>>  > > > > > >> > > > > "batchSize" : 100,
>>  > > > > > >> > > > > "when" : "exists(field1)"
>>  > > > > > >> > > > > },
>>  > > > > > >> > > > > "hdfs" : {
>>  > > > > > >> > > > > "when" : "false"
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > > }
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > - ES writer writes messages which have field1, HDFS
>>  > > doesn't
>>  > > > > > >> > > > > - index is named "foo", different from the sensor for
>>  > all
>>  > > > > > >> writers
>>  > > > > > >> > > > > - 100 for elasticsearch writers
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > Thoughts?
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
>>  > > > > > >> cduby@hortonworks.com
>>  > > > > > >> > >
>>  > > > > > >> > > > > wrote:
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > > For larger installations you need to control what is
>>  > > > indexed
>>  > > > > > so
>>  > > > > > >> you
>>  > > > > > >> > > > don’t
>>  > > > > > >> > > > > > end up with a nasty elastic search situation and so
>>  > you
>>  > > > can
>>  > > > > > mine
>>  > > > > > >> > the
>>  > > > > > >> > > > data
>>  > > > > > >> > > > > > later for reports and training ml models.
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > > Thanks
>>  > > > > > >> > > > > > Carolyn
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <
>>  > cestella@gmail.com
>>  > > >
>>  > > > > > wrote:
>>  > > > > > >> > > > > >
>>  > > > > > >> > > > > > >OH that's a good idea!
>>  > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
>>  > > > > > >> nick@nickallen.org>
>>  > > > > > >> > > > wrote:
>>  > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> I like the "Index Filtering" option based on the
>>  > > > > > flexibility
>>  > > > > > >> > that
>>  > > > > > >> > > it
>>  > > > > > >> > > > > > >> provides. Should each output (HDFS, ES, etc) have
>>  > its
>>  > > > own
>>  > > > > > >> > > > > configuration
>>  > > > > > >> > > > > > >> settings? For example, aren't things like
>>  batching
>>  > > > > handled
>>  > > > > > >> > > > separately
>>  > > > > > >> > > > > > for
>>  > > > > > >> > > > > > >> HDFS versus Elasticsearch?
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >> Something along the lines of...
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >> {
>>  > > > > > >> > > > > > >> "hdfs" : {
>>  > > > > > >> > > > > > >> "when": "exists(field1)",
>>  > > > > > >> > > > > > >> "batchSize": 100
>>  > > > > > >> > > > > > >> },
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >> "elasticsearch" : {
>>  > > > > > >> > > > > > >> "when": "true",
>>  > > > > > >> > > > > > >> "batchSize": 1000,
>>  > > > > > >> > > > > > >> "index": "squid"
>>  > > > > > >> > > > > > >> }
>>  > > > > > >> > > > > > >> }
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
>>  > > > > > >> > cestella@gmail.com
>>  > > > > > >> > > >
>>  > > > > > >> > > > > > wrote:
>>  > > > > > >> > > > > > >>
>>  > > > > > >> > > > > > >> > Yeah, I tend to like the first option too. Any
>>  > > > > opposition
>>  > > > > > >> to
>>  > > > > > >> > > that
>>  > > > > > >> > > > > > from
>>  > > > > > >> > > > > > >> > anyone?
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > The points brought up are good ones and I think
>>  > > that
>>  > > > it
>>  > > > > > >> may be
>>  > > > > > >> > > > > worth a
>>  > > > > > >> > > > > > >> > broader discussion of the requirements of
>>  > indexing
>>  > > > in a
>>  > > > > > >> > separate
>>  > > > > > >> > > > dev
>>  > > > > > >> > > > > > list
>>  > > > > > >> > > > > > >> > thread. Maybe a list of desires with coherent
>>  > > > use-cases
>>  > > > > > >> > > > justifying
>>  > > > > > >> > > > > > them
>>  > > > > > >> > > > > > >> so
>>  > > > > > >> > > > > > >> > we can think about how this stuff should work
>>  and
>>  > > > where
>>  > > > > > the
>>  > > > > > >> > > > natural
>>  > > > > > >> > > > > > >> > extension points should be. Afterall, we need
>>  to
>>  > > toe
>>  > > > > the
>>  > > > > > >> line
>>  > > > > > >> > > > > between
>>  > > > > > >> > > > > > >> > engineering and overengineering for features
>>  > nobody
>>  > > > > will
>>  > > > > > >> want.
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > I'm not sure about the extensions to the
>>  standard
>>  > > > > fields.
>>  > > > > > >> I'm
>>  > > > > > >> > > > torn
>>  > > > > > >> > > > > > >> between
>>  > > > > > >> > > > > > >> > the notions that we should have no standard
>>  > fields
>>  > > vs
>>  > > > > we
>>  > > > > > >> > should
>>  > > > > > >> > > > > have a
>>  > > > > > >> > > > > > >> > boatload of standard fields (with most of them
>>  > > > empty).
>>  > > > > I
>>  > > > > > >> > > exchange
>>  > > > > > >> > > > > > >> > positions fairly regularly on that question. ;)
>>  > It
>>  > > > may
>>  > > > > be
>>  > > > > > >> > > worth a
>>  > > > > > >> > > > > dev
>>  > > > > > >> > > > > > >> list
>>  > > > > > >> > > > > > >> > discussion to lay out how you imagine an
>>  > extension
>>  > > of
>>  > > > > > >> standard
>>  > > > > > >> > > > > fields
>>  > > > > > >> > > > > > and
>>  > > > > > >> > > > > > >> > how it might look as implemented in Metron.
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > Casey
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > Casey
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle
>>  Richardson
>>  > <
>>  > > > > > >> > > > > > >> > kylerichardson2@gmail.com>
>>  > > > > > >> > > > > > >> > wrote:
>>  > > > > > >> > > > > > >> >
>>  > > > > > >> > > > > > >> > > I'll second my preference for the first
>>  > option. I
>>  > > > > think
>>  > > > > > >> the
>>  > > > > > >> > > > > ability
>>  > > > > > >> > > > > > to
>>  > > > > > >> > > > > > >> > use
>>  > > > > > >> > > > > > >> > > Stellar filters to customize indexing would
>>  be
>>  > a
>>  > > > big
>>  > > > > > win.
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > I'm glad Matt brought up the point about data
>>  > > lake
>>  > > > > and
>>  > > > > > >> CEP.
>>  > > > > > >> > I
>>  > > > > > >> > > > > think
>>  > > > > > >> > > > > > >> this
>>  > > > > > >> > > > > > >> > is
>>  > > > > > >> > > > > > >> > > a really important use case that we need to
>>  > > > consider.
>>  > > > > > >> Take a
>>  > > > > > >> > > > > simple
>>  > > > > > >> > > > > > >> > > example... If I have data coming in from 3
>>  > > > different
>>  > > > > > >> > firewall
>>  > > > > > >> > > > > > vendors
>>  > > > > > >> > > > > > >> > and 2
>>  > > > > > >> > > > > > >> > > different web proxy/url filtering vendors
>>  and I
>>  > > > want
>>  > > > > to
>>  > > > > > >> be
>>  > > > > > >> > > able
>>  > > > > > >> > > > to
>>  > > > > > >> > > > > > >> > analyze
>>  > > > > > >> > > > > > >> > > that data set, I need the data to be indexed
>>  > all
>>  > > > > > together
>>  > > > > > >> > > > (likely
>>  > > > > > >> > > > > in
>>  > > > > > >> > > > > > >> > HDFS)
>>  > > > > > >> > > > > > >> > > and to have a normalized schema such that IP
>>  > > > address,
>>  > > > > > >> URL,
>>  > > > > > >> > and
>>  > > > > > >> > > > > user
>>  > > > > > >> > > > > > >> name
>>  > > > > > >> > > > > > >> > > (to take a few) can be easily queried and
>>  > > > > aggregated. I
>>  > > > > > >> can
>>  > > > > > >> > > also
>>  > > > > > >> > > > > > >> envision
>>  > > > > > >> > > > > > >> > > scenarios where I would want to index data
>>  > based
>>  > > on
>>  > > > > > >> > attributes
>>  > > > > > >> > > > > other
>>  > > > > > >> > > > > > >> than
>>  > > > > > >> > > > > > >> > > sensor, business unit or subsidiary for
>>  > example.
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > I've been wanted to propose extending our 7
>>  > > > standard
>>  > > > > > >> fields
>>  > > > > > >> > to
>>  > > > > > >> > > > > > include
>>  > > > > > >> > > > > > >> > > things like URL and user. Is there community
>>  > > > > > >> > interest/support
>>  > > > > > >> > > > for
>>  > > > > > >> > > > > > >> moving
>>  > > > > > >> > > > > > >> > in
>>  > > > > > >> > > > > > >> > > that direction? If so, I'll start a new
>>  thread.
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > Thanks!
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > -Kyle
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
>>  > > > > > >> > mattf@apache.org
>>  > > > > > >> > > >
>>  > > > > > >> > > > > > wrote:
>>  > > > > > >> > > > > > >> > >
>>  > > > > > >> > > > > > >> > > > Ah, I see. If overriding the default index
>>  > name
>>  > > > > > allows
>>  > > > > > >> > > using
>>  > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> same
>>  > > > > > >> > > > > > >> > > > name for multiple sensors, then the goal
>>  can
>>  > be
>>  > > > > > >> achieved.
>>  > > > > > >> > > > > > >> > > > Thanks,
>>  > > > > > >> > > > > > >> > > > --Matt
>>  > > > > > >> > > > > > >> > > >
>>  > > > > > >> > > > > > >> > > >
>>  > > > > > >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <
>>  > > > > > >> cestella@gmail.com>
>>  > > > > > >> > > > wrote:
>>  > > > > > >> > > > > > >> > > >
>>  > > > > > >> > > > > > >> > > > Oh, you could! Let's say you have a syslog
>>  > > parser
>>  > > > > > >> > with
>>  > > > > > >> > > > data
>>  > > > > > >> > > > > > from
>>  > > > > > >> > > > > > >> > > > sources 1
>>  > > > > > >> > > > > > >> > > > 2 and 3. You'd end up with one kafka queue
>>  > > with 3
>>  > > > > > >> > > parsers
>>  > > > > > >> > > > > > >> attached
>>  > > > > > >> > > > > > >> > > to
>>  > > > > > >> > > > > > >> > > > that
>>  > > > > > >> > > > > > >> > > > queue, each picking part the messages from
>>  > > source
>>  > > > > > >> 1, 2
>>  > > > > > >> > > and
>>  > > > > > >> > > > > 3.
>>  > > > > > >> > > > > > >> > They'd
>>  > > > > > >> > > > > > >> > > > go
>>  > > > > > >> > > > > > >> > > > through separate enrichment and into the
>>  > > indexing
>>  > > > > > >> > > > topology.
>>  > > > > > >> > > > > > In
>>  > > > > > >> > > > > > >> the
>>  > > > > > >> > > > > > >> > > > indexing topology, you could specify the
>>  same
>>  > > > index
>>  > > > > > >> > name
>>  > > > > > >> > > > > > "syslog"
>>  > > > > > >> > > > > > >> > and
>>  > > > > > >> > > > > > >> > > > all
>>  > > > > > >> > > > > > >> > > > of the messages go into the same index for
>>  > CEP
>>  > > > > > >> > querying
>>  > > > > > >> > > if
>>  > > > > > >> > > > > so
>>  > > > > > >> > > > > > >> > > desired.
>>  > > > > > >> > > > > > >> > > >
>>  > > > > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt
>>  Foley <
>>  > > > > > >> > > > > mattf@apache.org
>>  > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > wrote:
>>  > > > > > >> > > > > > >> > > >
>>  > > > > > >> > > > > > >> > > > > Syslog is hell on parsers – I know, I
>>  > worked
>>  > > at
>>  > > > > > >> > > LogLogic
>>  > > > > > >> > > > > in
>>  > > > > > >> > > > > > a
>>  > > > > > >> > > > > > >> > > > previous
>>  > > > > > >> > > > > > >> > > > > life. It makes perfect sense to route
>>  > > different
>>  > > > > > >> > lines
>>  > > > > > >> > > > > from
>>  > > > > > >> > > > > > >> > syslog
>>  > > > > > >> > > > > > >> > > > through
>>  > > > > > >> > > > > > >> > > > > different appropriate parsers. But a lot
>>  of
>>  > > > what
>>  > > > > > >> > the
>>  > > > > > >> > > > > > parsers
>>  > > > > > >> > > > > > >> do
>>  > > > > > >> > > > > > >> > is
>>  > > > > > >> > > > > > >> > > > > identify consistent subsets of metadata
>>  and
>>  > > > > > >> annotate
>>  > > > > > >> > > it
>>  > > > > > >> > > > –
>>  > > > > > >> > > > > > eg,
>>  > > > > > >> > > > > > >> > > > src_ip_addr,
>>  > > > > > >> > > > > > >> > > > > event timestamps, etc. Once those
>>  metadata
>>  > > are
>>  > > > > > >> > > > annotated
>>  > > > > > >> > > > > > and
>>  > > > > > >> > > > > > >> > > > available
>>  > > > > > >> > > > > > >> > > > > with common field names, why doesn’t it
>>  > make
>>  > > > > > >> sense
>>  > > > > > >> > to
>>  > > > > > >> > > > > index
>>  > > > > > >> > > > > > the
>>  > > > > > >> > > > > > >> > > > messages
>>  > > > > > >> > > > > > >> > > > > together, for CEP querying? I think
>>  Splunk
>>  > > has
>>  > > > > > >> > > > > illustrated
>>  > > > > > >> > > > > > >> this
>>  > > > > > >> > > > > > >> > > > model.
>>  > > > > > >> > > > > > >> > > > >
>>  > > > > > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <
>>  > > > > > >> > > cestella@gmail.com
>>  > > > > > >> > > > >
>>  > > > > > >> > > > > > >> wrote:
>>  > > > > > >> > > > > > >> > > > >
>>  > > > > > >> > > > > > >> > > > > yeah, I mean, honestly, I think the
>>  > approach
>>  > > > > > >> > that
>>  > > > > > >> > > > > we've
>>  > > > > > >> > > > > > >> taken
>>  > > > > > >> > > > > > >> > > for
>>  > > > > > >> > > > > > >> > > > > sources
>>  > > > > > >> > > > > > >> > > > > which aggregate different types of data
>>  is
>>  > to
>>  > > > > > >> > > > provide
>>  > > > > > >> > > > > > >> filters
>>  > > > > > >> > > > > > >> > > at
>>  > > > > > >> > > > > > >> > > > the
>>  > > > > > >> > > > > > >> > > > > parser
>>  > > > > > >> > > > > > >> > > > > level and have multiple parser topologies
>>  > > > > > >> (with
>>  > > > > > >> > > > > > different,
>>  > > > > > >> > > > > > >> > > > possibly
>>  > > > > > >> > > > > > >> > > > > mutually exclusive filters) running. This
>>  > > > > > >> would
>>  > > > > > >> > > be
>>  > > > > > >> > > > a
>>  > > > > > >> > > > > > >> > > completely
>>  > > > > > >> > > > > > >> > > > > separate
>>  > > > > > >> > > > > > >> > > > > sensor. Imagine a syslog data source that
>>  > > > > > >> > > > aggregates
>>  > > > > > >> > > > > > and
>>  > > > > > >> > > > > > >> you
>>  > > > > > >> > > > > > >> > > > want to
>>  > > > > > >> > > > > > >> > > > > pick
>>  > > > > > >> > > > > > >> > > > > apart certain pieces of messages. This is
>>  > > > > > >> why
>>  > > > > > >> > the
>>  > > > > > >> > > > > > initial
>>  > > > > > >> > > > > > >> > > > thought and
>>  > > > > > >> > > > > > >> > > > > architecture was one index per sensor.
>>  > > > > > >> > > > > > >> > > > >
>>  > > > > > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt
>>  > Foley <
>>  > > > > > >> > > > > > >> > mattf@apache.org>
>>  > > > > > >> > > > > > >> > > > wrote:
>>  > > > > > >> > > > > > >> > > > >
>>  > > > > > >> > > > > > >> > > > > > I’m thinking that CEP (Complex Event
>>  > > > > > >> > Processing)
>>  > > > > > >> > > > is
>>  > > > > > >> > > > > > >> > contrary
>>  > > > > > >> > > > > > >> > > > to the
>>  > > > > > >> > > > > > >> > > > > idea
>>  > > > > > >> > > > > > >> > > > > > of silo-ing data per sensor.
>>  > > > > > >> > > > > > >> > > > > > Now it’s true that some of those
>>  sensors
>>  > > > > > >> are
>>  > > > > > >> > > > already
>>  > > > > > >> > > > > > >> > > > aggregating
>>  > > > > > >> > > > > > >> > > > > data from
>>  > > > > > >> > > > > > >> > > > > > multiple sources, so maybe I’m wrong
>>  > here.
>>  > > > > > >> > > > > > >> > > > > > But it just seems to me that the “data
>>  > > > > > >> lake”
>>  > > > > > >> > > > > insights
>>  > > > > > >> > > > > > >> come
>>  > > > > > >> > > > > > >> > > from
>>  > > > > > >> > > > > > >> > > > > being able
>>  > > > > > >> > > > > > >> > > > > > to make decisions over the whole mass
>>  of
>>  > > > > > >> data
>>  > > > > > >> > > > rather
>>  > > > > > >> > > > > > than
>>  > > > > > >> > > > > > >> > > just
>>  > > > > > >> > > > > > >> > > > > vertical
>>  > > > > > >> > > > > > >> > > > > > slices of it.
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <
>>  > > > > > >> > > > > > cestella@gmail.com>
>>  > > > > > >> > > > > > >> > > > wrote:
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > Hey Matt,
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > Thanks for the comment!
>>  > > > > > >> > > > > > >> > > > > > 1. At the moment, we only have one
>>  > > > > > >> index
>>  > > > > > >> > > name,
>>  > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> > > default
>>  > > > > > >> > > > > > >> > > > of
>>  > > > > > >> > > > > > >> > > > > which is
>>  > > > > > >> > > > > > >> > > > > > the
>>  > > > > > >> > > > > > >> > > > > > sensor name but that's entirely up to
>>  > > > > > >> the
>>  > > > > > >> > > > user.
>>  > > > > > >> > > > > > This
>>  > > > > > >> > > > > > >> > is
>>  > > > > > >> > > > > > >> > > > sensor
>>  > > > > > >> > > > > > >> > > > > > specific,
>>  > > > > > >> > > > > > >> > > > > > so it'd be a separate config for each
>>  > > > > > >> > > sensor.
>>  > > > > > >> > > > > If
>>  > > > > > >> > > > > > we
>>  > > > > > >> > > > > > >> > want
>>  > > > > > >> > > > > > >> > > > to
>>  > > > > > >> > > > > > >> > > > > build
>>  > > > > > >> > > > > > >> > > > > > multiple
>>  > > > > > >> > > > > > >> > > > > > indices per sensor, we'd have to think
>>  > > > > > >> > > > carefully
>>  > > > > > >> > > > > > >> about
>>  > > > > > >> > > > > > >> > > how
>>  > > > > > >> > > > > > >> > > > to do
>>  > > > > > >> > > > > > >> > > > > that
>>  > > > > > >> > > > > > >> > > > > > and
>>  > > > > > >> > > > > > >> > > > > > would be a bigger undertaking. I
>>  > > > > > >> guess I
>>  > > > > > >> > > can
>>  > > > > > >> > > > > see
>>  > > > > > >> > > > > > the
>>  > > > > > >> > > > > > >> > > use,
>>  > > > > > >> > > > > > >> > > > though
>>  > > > > > >> > > > > > >> > > > > > (redirect
>>  > > > > > >> > > > > > >> > > > > > messages to one index vs another based
>>  > > > > > >> on
>>  > > > > > >> > a
>>  > > > > > >> > > > > > predicate
>>  > > > > > >> > > > > > >> > for
>>  > > > > > >> > > > > > >> > > > a given
>>  > > > > > >> > > > > > >> > > > > > sensor).
>>  > > > > > >> > > > > > >> > > > > > Anyway, not where I was originally
>>  > > > > > >> > thinking
>>  > > > > > >> > > > that
>>  > > > > > >> > > > > > this
>>  > > > > > >> > > > > > >> > > > discussion
>>  > > > > > >> > > > > > >> > > > > would
>>  > > > > > >> > > > > > >> > > > > > go,
>>  > > > > > >> > > > > > >> > > > > > but it's an interesting point.
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > 2. I hadn't thought through the
>>  > > > > > >> > > implementation
>>  > > > > > >> > > > > > quite
>>  > > > > > >> > > > > > >> > yet,
>>  > > > > > >> > > > > > >> > > > but we
>>  > > > > > >> > > > > > >> > > > > don't
>>  > > > > > >> > > > > > >> > > > > > actually have a splitter bolt in that
>>  > > > > > >> > > > topology,
>>  > > > > > >> > > > > > just
>>  > > > > > >> > > > > > >> a
>>  > > > > > >> > > > > > >> > > > spout
>>  > > > > > >> > > > > > >> > > > > that goes
>>  > > > > > >> > > > > > >> > > > > > to
>>  > > > > > >> > > > > > >> > > > > > the elasticsearch writer and also to
>>  > > > > > >> the
>>  > > > > > >> > > hdfs
>>  > > > > > >> > > > > > writer.
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt
>>  > > > > > >> > Foley
>>  > > > > > >> > > <
>>  > > > > > >> > > > > > >> > > > mattf@apache.org>
>>  > > > > > >> > > > > > >> > > > > wrote:
>>  > > > > > >> > > > > > >> > > > > >
>>  > > > > > >> > > > > > >> > > > > > > Casey, good to have controls like
>>  > > > > > >> this.
>>  > > > > > >> > > > > Couple
>>  > > > > > >> > > > > > >> > > > questions:
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > 1. Regarding the “index” : “squid”
>>  > > > > > >> > > > name/value
>>  > > > > > >> > > > > > pair,
>>  > > > > > >> > > > > > >> > is
>>  > > > > > >> > > > > > >> > > > the
>>  > > > > > >> > > > > > >> > > > > index name
>>  > > > > > >> > > > > > >> > > > > > > expected to always be a sensor
>>  > > > > > >> name? Or
>>  > > > > > >> > > is
>>  > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> given
>>  > > > > > >> > > > > > >> > > > json
>>  > > > > > >> > > > > > >> > > > > structure
>>  > > > > > >> > > > > > >> > > > > > > subordinate to a sensor name in
>>  > > > > > >> > zookeeper?
>>  > > > > > >> > > > Or
>>  > > > > > >> > > > > > can
>>  > > > > > >> > > > > > >> we
>>  > > > > > >> > > > > > >> > > > build
>>  > > > > > >> > > > > > >> > > > > arbitrary
>>  > > > > > >> > > > > > >> > > > > > > indexes with this new specification,
>>  > > > > > >> > > > > > independent of
>>  > > > > > >> > > > > > >> > > > sensor?
>>  > > > > > >> > > > > > >> > > > > Should
>>  > > > > > >> > > > > > >> > > > > > there
>>  > > > > > >> > > > > > >> > > > > > > actually be a list of “indexes”, ie
>>  > > > > > >> > > > > > >> > > > > > > { “indexes” : [
>>  > > > > > >> > > > > > >> > > > > > > {“index” : “name1”,
>>  > > > > > >> > > > > > >> > > > > > > …
>>  > > > > > >> > > > > > >> > > > > > > },
>>  > > > > > >> > > > > > >> > > > > > > {“index” : “name2”,
>>  > > > > > >> > > > > > >> > > > > > > …
>>  > > > > > >> > > > > > >> > > > > > > } ]
>>  > > > > > >> > > > > > >> > > > > > > }
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > 2. Would the filtering / writer
>>  > > > > > >> > selection
>>  > > > > > >> > > > > logic
>>  > > > > > >> > > > > > >> take
>>  > > > > > >> > > > > > >> > > > place in
>>  > > > > > >> > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> > > > > > indexing
>>  > > > > > >> > > > > > >> > > > > > > topology splitter bolt? Seems like
>>  > > > > > >> that
>>  > > > > > >> > > > would
>>  > > > > > >> > > > > > have
>>  > > > > > >> > > > > > >> > the
>>  > > > > > >> > > > > > >> > > > > smallest
>>  > > > > > >> > > > > > >> > > > > > impact on
>>  > > > > > >> > > > > > >> > > > > > > current implementation, no?
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > Sorry if these are already answered
>>  > > > > > >> in
>>  > > > > > >> > > > > PR-415, I
>>  > > > > > >> > > > > > >> > > haven’t
>>  > > > > > >> > > > > > >> > > > had
>>  > > > > > >> > > > > > >> > > > > time to
>>  > > > > > >> > > > > > >> > > > > > > review that one yet.
>>  > > > > > >> > > > > > >> > > > > > > Thanks,
>>  > > > > > >> > > > > > >> > > > > > > --Matt
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael
>>  > > > > > >> > Miklavcic"
>>  > > > > > >> > > <
>>  > > > > > >> > > > > > >> > > > > > michael.miklavcic@gmail.com>
>>  > > > > > >> > > > > > >> > > > > > > wrote:
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > I like the flexibility and
>>  > > > > > >> > > > expressibility
>>  > > > > > >> > > > > of
>>  > > > > > >> > > > > > >> the
>>  > > > > > >> > > > > > >> > > > first
>>  > > > > > >> > > > > > >> > > > > option
>>  > > > > > >> > > > > > >> > > > > > with
>>  > > > > > >> > > > > > >> > > > > > > Stellar
>>  > > > > > >> > > > > > >> > > > > > > filters.
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > M
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM,
>>  > > > > > >> > Casey
>>  > > > > > >> > > > > > Stella <
>>  > > > > > >> > > > > > >> > > > > > cestella@gmail.com>
>>  > > > > > >> > > > > > >> > > > > > > wrote:
>>  > > > > > >> > > > > > >> > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > As of METRON-652 <
>>  > > > > > >> > > > > > https://github.com/apache/
>>  > > > > > >> > > > > > >> > > > > > > incubator-metron/pull/415>, we
>>  > > > > > >> > > > > > >> > > > > > > > will have decoupled the
>>  > > > > > >> indexing
>>  > > > > > >> > > > > > >> configuration
>>  > > > > > >> > > > > > >> > > > from the
>>  > > > > > >> > > > > > >> > > > > > enrichment
>>  > > > > > >> > > > > > >> > > > > > > > configuration. As an immediate
>>  > > > > > >> > > > > follow-up
>>  > > > > > >> > > > > > to
>>  > > > > > >> > > > > > >> > > that,
>>  > > > > > >> > > > > > >> > > > I'd
>>  > > > > > >> > > > > > >> > > > > like to
>>  > > > > > >> > > > > > >> > > > > > > provide the
>>  > > > > > >> > > > > > >> > > > > > > > ability to turn off and on
>>  > > > > > >> writers
>>  > > > > > >> > > via
>>  > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> > > > configs. I'd
>>  > > > > > >> > > > > > >> > > > > like
>>  > > > > > >> > > > > > >> > > > > > to get
>>  > > > > > >> > > > > > >> > > > > > > some
>>  > > > > > >> > > > > > >> > > > > > > > community feedback on how the
>>  > > > > > >> > > > > > functionality
>>  > > > > > >> > > > > > >> > > should
>>  > > > > > >> > > > > > >> > > > work,
>>  > > > > > >> > > > > > >> > > > > if
>>  > > > > > >> > > > > > >> > > > > > y'all are
>>  > > > > > >> > > > > > >> > > > > > > > amenable. :)
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > As of now, we have 3 possible
>>  > > > > > >> > > writers
>>  > > > > > >> > > > > > which
>>  > > > > > >> > > > > > >> can
>>  > > > > > >> > > > > > >> > > be
>>  > > > > > >> > > > > > >> > > > used
>>  > > > > > >> > > > > > >> > > > > in the
>>  > > > > > >> > > > > > >> > > > > > > indexing
>>  > > > > > >> > > > > > >> > > > > > > > topology:
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > - Solr
>>  > > > > > >> > > > > > >> > > > > > > > - Elasticsearch
>>  > > > > > >> > > > > > >> > > > > > > > - HDFS
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > HDFS is always used,
>>  > > > > > >> elasticsearch
>>  > > > > > >> > > or
>>  > > > > > >> > > > > > solr is
>>  > > > > > >> > > > > > >> > > used
>>  > > > > > >> > > > > > >> > > > > depending
>>  > > > > > >> > > > > > >> > > > > > on how
>>  > > > > > >> > > > > > >> > > > > > > you
>>  > > > > > >> > > > > > >> > > > > > > > start the indexing topology.
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > A couple of proposals come to
>>  > > > > > >> mind
>>  > > > > > >> > > > > > >> immediately:
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > *Index Filtering*
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > You would be able to specify a
>>  > > > > > >> > > filter
>>  > > > > > >> > > > as
>>  > > > > > >> > > > > > >> > defined
>>  > > > > > >> > > > > > >> > > > by a
>>  > > > > > >> > > > > > >> > > > > stellar
>>  > > > > > >> > > > > > >> > > > > > > statement
>>  > > > > > >> > > > > > >> > > > > > > > (likely a reuse of the
>>  > > > > > >> > StellarFilter
>>  > > > > > >> > > > > that
>>  > > > > > >> > > > > > >> > exists
>>  > > > > > >> > > > > > >> > > > in the
>>  > > > > > >> > > > > > >> > > > > > Parsers)
>>  > > > > > >> > > > > > >> > > > > > > which
>>  > > > > > >> > > > > > >> > > > > > > > would allow you to indicate on
>>  > > > > > >> a
>>  > > > > > >> > > > > > >> > > > message-by-message basis
>>  > > > > > >> > > > > > >> > > > > > whether or
>>  > > > > > >> > > > > > >> > > > > > > not to
>>  > > > > > >> > > > > > >> > > > > > > > write the message.
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > The semantics of this would be
>>  > > > > > >> as
>>  > > > > > >> > > > > follows:
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > - Default (i.e.
>>  > > > > > >> unspecified) is
>>  > > > > > >> > > to
>>  > > > > > >> > > > > pass
>>  > > > > > >> > > > > > >> > > > everything
>>  > > > > > >> > > > > > >> > > > > through
>>  > > > > > >> > > > > > >> > > > > > (hence
>>  > > > > > >> > > > > > >> > > > > > > > backwards compatible with
>>  > > > > > >> the
>>  > > > > > >> > > > current
>>  > > > > > >> > > > > > >> > default
>>  > > > > > >> > > > > > >> > > > config).
>>  > > > > > >> > > > > > >> > > > > > > > - Messages which have the
>>  > > > > > >> > > > associated
>>  > > > > > >> > > > > > >> stellar
>>  > > > > > >> > > > > > >> > > > statement
>>  > > > > > >> > > > > > >> > > > > > evaluate
>>  > > > > > >> > > > > > >> > > > > > > to true
>>  > > > > > >> > > > > > >> > > > > > > > for the writer type will be
>>  > > > > > >> > > > written,
>>  > > > > > >> > > > > > >> > otherwise
>>  > > > > > >> > > > > > >> > > > not.
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > Sample indexing config which
>>  > > > > > >> would
>>  > > > > > >> > > > write
>>  > > > > > >> > > > > > out
>>  > > > > > >> > > > > > >> no
>>  > > > > > >> > > > > > >> > > > messages
>>  > > > > > >> > > > > > >> > > > > to
>>  > > > > > >> > > > > > >> > > > > > HDFS and
>>  > > > > > >> > > > > > >> > > > > > > write
>>  > > > > > >> > > > > > >> > > > > > > > out only messages containing a
>>  > > > > > >> > field
>>  > > > > > >> > > > > > called
>>  > > > > > >> > > > > > >> > > > "field1":
>>  > > > > > >> > > > > > >> > > > > > > > {
>>  > > > > > >> > > > > > >> > > > > > > > "index" : "squid"
>>  > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
>>  > > > > > >> > > > > > >> > > > > > > > ,"filters" : {
>>  > > > > > >> > > > > > >> > > > > > > > "HDFS" : "false"
>>  > > > > > >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)"
>>  > > > > > >> > > > > > >> > > > > > > > }
>>  > > > > > >> > > > > > >> > > > > > > > }
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > *Index On/Off Switch*
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > A simpler solution would be to
>>  > > > > > >> > just
>>  > > > > > >> > > > > > provide a
>>  > > > > > >> > > > > > >> > > list
>>  > > > > > >> > > > > > >> > > > of
>>  > > > > > >> > > > > > >> > > > > writers
>>  > > > > > >> > > > > > >> > > > > > to
>>  > > > > > >> > > > > > >> > > > > > > write
>>  > > > > > >> > > > > > >> > > > > > > > messages. The semantics would
>>  > > > > > >> be
>>  > > > > > >> > as
>>  > > > > > >> > > > > > follows:
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > - If the list is
>>  > > > > > >> unspecified,
>>  > > > > > >> > > then
>>  > > > > > >> > > > > the
>>  > > > > > >> > > > > > >> > default
>>  > > > > > >> > > > > > >> > > > is to
>>  > > > > > >> > > > > > >> > > > > write
>>  > > > > > >> > > > > > >> > > > > > all
>>  > > > > > >> > > > > > >> > > > > > > messages
>>  > > > > > >> > > > > > >> > > > > > > > for every writer in the
>>  > > > > > >> > indexing
>>  > > > > > >> > > > > > topology
>>  > > > > > >> > > > > > >> > > > > > > > - If the list is specified,
>>  > > > > > >> > then
>>  > > > > > >> > > a
>>  > > > > > >> > > > > > writer
>>  > > > > > >> > > > > > >> > will
>>  > > > > > >> > > > > > >> > > > write
>>  > > > > > >> > > > > > >> > > > > all
>>  > > > > > >> > > > > > >> > > > > > messages
>>  > > > > > >> > > > > > >> > > > > > > if and
>>  > > > > > >> > > > > > >> > > > > > > > only if it is named in the
>>  > > > > > >> > list.
>>  > > > > > >> > > > > > >> > > > > > > >
>>  > > > > > >> > > > > > >> > > > > > > > Sample indexing config which
>>  > > > > > >> turns
>>  > > > > > >> > > off
>>  > > > > > >> > > > > > HDFS
>>  > > > > > >> > > > > > >> and
>>  > > > > > >> > > > > > >> > > > keeps on
>>  > > > > > >> > > > > > >> > > > > > > Elasticsearch:
>>  > > > > > >> > > > > > >> > > > > > > > {
>>  > > > > > >> > > > > > >> > > > > > > > "index" : "squid"
>>  > > > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100
>>  > > > > > >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ]
>>  > >
>>  > > --
>>  >
>>  > Jon
>>  >
>>  > Sent from my mobile device
>>  >
>
> --
> Nick Allen <nick@nickallen.org>

------------------- 
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

Mime
View raw message