metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Richardson <kylerichards...@gmail.com>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Sat, 14 Jan 2017 17:13:35 GMT
I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's enabled property. I also like the idea of a path property for HDFS.

-Kyle

> On Jan 14, 2017, at 10:51 AM, Casey Stella <cestella@gmail.com> wrote:
> 
> I'm +1 on an explicit enabled property and a filter (or when) property. I
> think we are zeroing in on a decent design, so that is good.
> 
> To recap, what I am +1 on is Nick's proposed syntax with the following
> modifications:
> 1. An explicit enabled field
> 2. A default on for unspecified to match current semantics
> 
> Casey
>> On Sat, Jan 14, 2017 at 10:45 Zeolla@GMail.com <zeolla@gmail.com> wrote:
>> 
>> This has the additional benefit of doing something like below when you want
>> to temporarily disable the hdfs writer, but don't want to remove the
>> settings.  This removes the need to store the path and batchSize (and many
>> additional settings) somewhere else so they can be brought back in when you
>> want to re-enable it, which is a nice workflow attribute for the end user:
>> 
>> {
>>   'elasticsearch': {
>>      'enabled': 'true',
>>      'index': 'foo',
>>      'batchSize': 100,
>>    },
>>   'hdfs': {
>>      'enabled': 'false',
>>      'path': '/foo/bar/...',
>>      'batchSize': 100,
>>    },
>>   'solr': {
>>      'enabled': 'false'
>>    }
>> }
>> 
>> Jon
>> 
>>> On Sat, Jan 14, 2017 at 9:24 AM Zeolla@GMail.com <zeolla@gmail.com> wrote:
>>> 
>>> I similarly have a concern there because I prefer being as explicit as
>>> possible, which makes things easier to pick up for new users.  Using my
>>> example from earlier this could look like specifying while(false), but an
>>> even better and more obvious approach may be to use enabled(false).  So
>> the
>>> current simple default would be:
>>> 
>>> {
>>>   'elasticsearch': { 'enabled': 'true' },
>>>   'hdfs': { 'enabled': 'true' },
>>>   'solr': { enabled': 'false' }
>>> }
>>> 
>>> And to use ES with some overrides but not HDFS or solr it would look
>> like:
>>> 
>>> {
>>>   'elasticsearch': {
>>>      'enabled': 'true',
>>>      'index': 'foo',
>>>      'batchSize': 100
>>>    },
>>>   'hdfs': {
>>>      'enabled': 'false'
>>>    },
>>>   'solr': {
>>>      'enabled': 'false'
>>>    }
>>> }
>>> 
>>> Jon
>>> 
>>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <cestella@gmail.com>
>> wrote:
>>> 
>>> One thing that I thought of that I very strenuous do not like in Nick's
>>> proposal is that if a writer config is not specified then it is turned
>> off
>>> (I think; if I misunderstood let me know). In the situation where we
>> have a
>>> new sensor, right now if there are no index config and no enrichment
>>> config, it still passes through to the index using defaults. In this new
>>> scheme it would not. This changes the default semantics for the system
>> and
>>> I think it changes it for the worse.
>>> 
>>> I would strongly prefer a on-by-default indexing config as we have now.
>>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <cestella@gmail.com> wrote:
>>>> 
>>>> One thing that I really like about Nick's suggestion is that it allows
>>>> writer-specific configs in a clear and simple way.  It is more complex
>>> for
>>>> the default case (all writers write to indices named the same thing
>> with
>>> a
>>>> fixed batch size), which I do not like, but maybe it's worth the
>>> compromise
>>>> to make it less complex for the advanced case.
>>>> 
>>>> Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning
>>> to
>>>> lean your way.
>>>> 
>>>> On Fri, Jan 13, 2017 at 2:51 PM, Zeolla@GMail.com <zeolla@gmail.com>
>>>> wrote:
>>>> 
>>>> I like the suggestions you made, Nick.  The only thing I would add is
>>> that
>>>> it's also nice to see an explicit when(false), as people newer to the
>>>> platform may not know where to expect configs for the different
>> writers.
>>>> Being able to do it either way, which I think is already assumed in
>> your
>>>> model, would make sense.  I would just suggest that, if we support but
>>> are
>>>> disabling a writer, that the platform inserts a default when(false) to
>> be
>>>> explicit.
>>>> 
>>>> Jon
>>>> 
>>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <cestella@gmail.com>
>>> wrote:
>>>> 
>>>>> Let me noodle on this over the weekend.  Your syntax is looking less
>>>>> onerous to me and I like the following statement from Otto: "In the
>>> end,
>>>>> each write destination ‘type’ will need it’s own configuration.  This
>>> is
>>>> an
>>>>> extension point."
>>>>> 
>>>>> I may come around to your way of thinking.
>>>>> 
>>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
>> ottobackwards@gmail.com
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> In the end, each write destination ‘type’ will need it’s own
>>>>>> configuration.  This is an extension point.
>>>>>> {
>>>>>> HDFS:{
>>>>>> outputAdapters:[
>>>>>> {name: avro,
>>>>>> settings:{
>>>>>> avro stuff….
>>>>>> when:{
>>>>>> },
>>>>>> {
>>>>>> name: sequence file,
>>>>>> …..
>>>>>> 
>>>>>> or some such.
>>>>>> 
>>>>>> 
>>>>>> On January 13, 2017 at 11:51:15, Nick Allen (nick@nickallen.org)
>>>> wrote:
>>>>>> 
>>>>>> I will add also that instead of global overrides, like index, we
>>> should
>>>>> use
>>>>>> configuration key names that are more appropriate to the output.
>>>>>> 
>>>>>> For example, does 'index' really make sense for HDFS? Or would
>> 'path'
>>>> be
>>>>>> more appropriate?
>>>>>> 
>>>>>> {
>>>>>> 'elasticsearch': {
>>>>>> 'index': 'foo',
>>>>>> 'batchSize': 1
>>>>>> },
>>>>>> 'hdfs': {
>>>>>> 'path': '/foo/bar/...',
>>>>>> 'batchSize': 100
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all
>>> this,
>>>>>> Casey.
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <nick@nickallen.org>
>>>> wrote:
>>>>>> 
>>>>>>> Nick's concerns about my suggestion were that it was overly
>> complex
>>>> and
>>>>>>>> hard to grok and that we could dispense with backwards
>>> compatibility
>>>>> and
>>>>>>>> make people do a bit more work on the default case for the
>>> benefits
>>>>> of a
>>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
>>>>> position)
>>>>>>> 
>>>>>>> 
>>>>>>> I will add is that in my mind, the majority case would be a user
>>>>>>> specifying the outputs, but not things like 'batchSize' or
>> 'when'.
>>> I
>>>>>> think
>>>>>>> in the majority case, the user would accept whatever the default
>>>> batch
>>>>>> size
>>>>>>> is.
>>>>>>> 
>>>>>>> Here are alternatives suggestions for all the examples that you
>>>>> provided
>>>>>>> previously.
>>>>>>> 
>>>>>>> Base Case
>>>>>>> 
>>>>>>> - The user must always specify the 'outputs' for clarity.
>>>>>>> - Uses default index name, batch size and when = true.
>>>>>>> 
>>>>>>> {
>>>>>>> 'elasticsearch': {},
>>>>>>> 'hdfs': {}
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> <
>>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
>>>>>> a1#writer-non-specific-case>Writer-non-specific
>>>>>> 
>>>>>>> Case
>>>>>>> 
>>>>>>> - There are no global overrides, as in Casey's proposal.
>>>>>>> - Easier to grok IMO.
>>>>>>> 
>>>>>>> {
>>>>>>> 'elasticsearch': {
>>>>>>> 'index': 'foo',
>>>>>>> 'batchSize': 100
>>>>>>> },
>>>>>>> 'hdfs': {
>>>>>>> 'index': 'foo',
>>>>>>> 'batchSize': 100
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> <
>>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
>>>>>> a1#writer-specific-case-without-filters>Writer-specific
>>>>>> 
>>>>>>> case without filters
>>>>>>> 
>>>>>>> {
>>>>>>> 'elasticsearch': {
>>>>>>> 'index': 'foo',
>>>>>>> 'batchSize': 1
>>>>>>> },
>>>>>>> 'hdfs': {
>>>>>>> 'index': 'foo',
>>>>>>> 'batchSize': 100
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> <
>>>>>> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
>>>>>> a1#writer-specific-case-with-filters>Writer-specific
>>>>>> 
>>>>>>> case with filters
>>>>>>> 
>>>>>>> - Instead of having to say when=false, just don't configure HDFS
>>>>>>> 
>>>>>>> {
>>>>>>> 'elasticsearch': {
>>>>>>> 'index': 'foo',
>>>>>>> 'batchSize': 100,
>>>>>>> 'when': 'exists(field1)'
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
>> cestella@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dave,
>>>>>>>> For the benefit of posterity and people who might not be as
>> deeply
>>>>>>>> entangled in the system as we have been, I'll recap things and
>>>>> hopefully
>>>>>>>> answer your question in the process.
>>>>>>>> 
>>>>>>>> Historically the index configuration is split between the
>>> enrichment
>>>>>>>> configs and the global configs.
>>>>>>>> 
>>>>>>>> - The global configs really controls configs that apply to all
>>>>> sensors.
>>>>>>>> Historically this has been stuff like index connection strings,
>>> etc.
>>>>>>>> - The sensor-specific configs which control things that vary by
>>>>> sensor.
>>>>>>>> 
>>>>>>>> As of Metron-652 (in review currently), we moved the sensor
>>> specific
>>>>>>>> configs from the enrichment configs. The proposal here is to
>>>> increase
>>>>>> the
>>>>>>>> granularity of the the sensor specific files to make them
>> support
>>>>> index
>>>>>>>> writer-specific configs. Right now in the indexing topology, we
>>>> have 2
>>>>>>>> writers (fixed): ES/Solr and HDFS.
>>>>>>>> 
>>>>>>>> The proposed configuration would allow you to either specify a
>>>> blanket
>>>>>>>> sensor-level config for the index name and batchSize and/or
>>> override
>>>>> at
>>>>>>>> the
>>>>>>>> writer level, thereby supporting a couple of use-cases:
>>>>>>>> 
>>>>>>>> - Turning off certain index writers (e.g. HDFS)
>>>>>>>> - Filtering the messages written to certain index writers
>>>>>>>> 
>>>>>>>> The two competing configs between Nick and I are as follows:
>>>>>>>> 
>>>>>>>> - I want to make sure we keep the old sensor-specific defaults
>>> with
>>>>>>>> writer-specific overrides available
>>>>>>>> - Nick thought we could simplify the permutations by making the
>>>>>>>> indexing
>>>>>>>> config only the writer-level configs.
>>>>>>>> 
>>>>>>>> My concerns about Nick's suggestion were that the default and
>>>> majority
>>>>>>>> case, specifying the index and the batchSize for all writers (th
>>>> eone
>>>>> we
>>>>>>>> support now) would require more configuration.
>>>>>>>> 
>>>>>>>> Nick's concerns about my suggestion were that it was overly
>>> complex
>>>>> and
>>>>>>>> hard to grok and that we could dispense with backwards
>>> compatibility
>>>>> and
>>>>>>>> make people do a bit more work on the default case for the
>>> benefits
>>>>> of a
>>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
>>>>> position).
>>>>>>>> 
>>>>>>>> Casey
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
>>> dlyle65535@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Casey,
>>>>>>>>> 
>>>>>>>>> Can you give me a level set of what your thinking is now? I
>>> think
>>>>> it's
>>>>>>>>> global control of all index types + overrides on a per-type
>>> basis.
>>>>>> Fwiw,
>>>>>>>>> I'm totally for that, but I want to make sure I'm not imposing
>>> my
>>>>>>>>> pre-concieved notions on your consensus-driven ones.
>>>>>>>>> 
>>>>>>>>> -D....
>>>>>>>>> 
>>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
>>>> cestella@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I am suggesting that, yes. The configs are essentially the
>>> same
>>>> as
>>>>>>>>> yours,
>>>>>>>>>> except there is an override specified at the top level.
>>> Without
>>>>>>>> that, in
>>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100,
>> you
>>>>> have
>>>>>> to
>>>>>>>>>> explicitly configure each. It's less that I'm trying to have
>>>>>>>> backwards
>>>>>>>>>> compatibility and more that I'm trying to make the majority
>>> case
>>>>>> easy:
>>>>>>>>> both
>>>>>>>>>> writers write everything to a specified index name with a
>>>>> specified
>>>>>>>> batch
>>>>>>>>>> size (which is what we have now). Beyond that, I want to
>> allow
>>>> for
>>>>>>>>>> specifying an override for the config on a writer-by-writer
>>>> basis
>>>>>> for
>>>>>>>>> those
>>>>>>>>>> who need it.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
>>>> nick@nickallen.org>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Are you saying we support all of these variants? I realize
>>> you
>>>>> are
>>>>>>>>>> trying
>>>>>>>>>>> to have some backwards compatibility, but this also makes
>> it
>>>>>> harder
>>>>>>>>> for a
>>>>>>>>>>> user to grok (for me at least).
>>>>>>>>>>> 
>>>>>>>>>>> Personally I like my original example as there are fewer
>>>>>>>>> sub-structures,
>>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler
>> and
>>>>>> easier
>>>>>>>> to
>>>>>>>>>>> grok. But maybe others will think your proposal is just as
>>>> easy
>>>>> to
>>>>>>>>> grok.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
>>>>>> cestella@gmail.com>
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion:
>>>>>>>>>>>> 
>>>>>>>>>>>> - Keeping the configs that we have now (batchSize and
>>> index)
>>>>> as
>>>>>>>>>>> defaults
>>>>>>>>>>>> for the unspecified writer-specific case
>>>>>>>>>>>> - Adding the config Nick suggested
>>>>>>>>>>>> 
>>>>>>>>>>>> *Base Case*:
>>>>>>>>>>>> {
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> - all writers write all messages
>>>>>>>>>>>> - index named the same as the sensor for all writers
>>>>>>>>>>>> - batchSize of 1 for all writers
>>>>>>>>>>>> 
>>>>>>>>>>>> *Writer-non-specific case*:
>>>>>>>>>>>> {
>>>>>>>>>>>> "index" : "foo"
>>>>>>>>>>>> ,"batchSize" : 100
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> - All writers write all messages
>>>>>>>>>>>> - index is named "foo", different from the sensor for
>> all
>>>>>>>> writers
>>>>>>>>>>>> - batchSize is 100 for all writers
>>>>>>>>>>>> 
>>>>>>>>>>>> *Writer-specific case without filters*
>>>>>>>>>>>> {
>>>>>>>>>>>> "index" : "foo"
>>>>>>>>>>>> ,"batchSize" : 1
>>>>>>>>>>>> , "writerConfig" :
>>>>>>>>>>>> {
>>>>>>>>>>>> "elasticsearch" : {
>>>>>>>>>>>> "batchSize" : 100
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> - All writers write all messages
>>>>>>>>>>>> - index is named "foo", different from the sensor for
>> all
>>>>>>>> writers
>>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch
>>> writers
>>>>>>>>>>>> - NOTE: I could override the index name too
>>>>>>>>>>>> 
>>>>>>>>>>>> *Writer-specific case with filters*
>>>>>>>>>>>> {
>>>>>>>>>>>> "index" : "foo"
>>>>>>>>>>>> ,"batchSize" : 1
>>>>>>>>>>>> , "writerConfig" :
>>>>>>>>>>>> {
>>>>>>>>>>>> "elasticsearch" : {
>>>>>>>>>>>> "batchSize" : 100,
>>>>>>>>>>>> "when" : "exists(field1)"
>>>>>>>>>>>> },
>>>>>>>>>>>> "hdfs" : {
>>>>>>>>>>>> "when" : "false"
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> - ES writer writes messages which have field1, HDFS
>>> doesn't
>>>>>>>>>>>> - index is named "foo", different from the sensor for
>> all
>>>>>>>> writers
>>>>>>>>>>>> - 100 for elasticsearch writers
>>>>>>>>>>>> 
>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
>>>>>>>> cduby@hortonworks.com
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> For larger installations you need to control what is
>>>> indexed
>>>>>> so
>>>>>>>> you
>>>>>>>>>>> don’t
>>>>>>>>>>>>> end up with a nasty elastic search situation and so
>> you
>>>> can
>>>>>> mine
>>>>>>>>> the
>>>>>>>>>>> data
>>>>>>>>>>>>> later for reports and training ml models.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Carolyn
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" <
>> cestella@gmail.com
>>>> 
>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> OH that's a good idea!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
>>>>>>>> nick@nickallen.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I like the "Index Filtering" option based on the
>>>>>> flexibility
>>>>>>>>> that
>>>>>>>>>> it
>>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have
>> its
>>>> own
>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>> settings? For example, aren't things like batching
>>>>> handled
>>>>>>>>>>> separately
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> HDFS versus Elasticsearch?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Something along the lines of...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>> "hdfs" : {
>>>>>>>>>>>>>>> "when": "exists(field1)",
>>>>>>>>>>>>>>> "batchSize": 100
>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "elasticsearch" : {
>>>>>>>>>>>>>>> "when": "true",
>>>>>>>>>>>>>>> "batchSize": 1000,
>>>>>>>>>>>>>>> "index": "squid"
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
>>>>>>>>> cestella@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any
>>>>> opposition
>>>>>>>> to
>>>>>>>>>> that
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> anyone?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The points brought up are good ones and I think
>>> that
>>>> it
>>>>>>>> may be
>>>>>>>>>>>> worth a
>>>>>>>>>>>>>>>> broader discussion of the requirements of
>> indexing
>>>> in a
>>>>>>>>> separate
>>>>>>>>>>> dev
>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent
>>>> use-cases
>>>>>>>>>>> justifying
>>>>>>>>>>>>> them
>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>> we can think about how this stuff should work and
>>>> where
>>>>>> the
>>>>>>>>>>> natural
>>>>>>>>>>>>>>>> extension points should be. Afterall, we need to
>>> toe
>>>>> the
>>>>>>>> line
>>>>>>>>>>>> between
>>>>>>>>>>>>>>>> engineering and overengineering for features
>> nobody
>>>>> will
>>>>>>>> want.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard
>>>>> fields.
>>>>>>>> I'm
>>>>>>>>>>> torn
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>> the notions that we should have no standard
>> fields
>>> vs
>>>>> we
>>>>>>>>> should
>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>> boatload of standard fields (with most of them
>>>> empty).
>>>>> I
>>>>>>>>>> exchange
>>>>>>>>>>>>>>>> positions fairly regularly on that question. ;)
>> It
>>>> may
>>>>> be
>>>>>>>>>> worth a
>>>>>>>>>>>> dev
>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>> discussion to lay out how you imagine an
>> extension
>>> of
>>>>>>>> standard
>>>>>>>>>>>> fields
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> how it might look as implemented in Metron.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Casey
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson
>> <
>>>>>>>>>>>>>>>> kylerichardson2@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'll second my preference for the first
>> option. I
>>>>> think
>>>>>>>> the
>>>>>>>>>>>> ability
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be
>> a
>>>> big
>>>>>> win.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data
>>> lake
>>>>> and
>>>>>>>> CEP.
>>>>>>>>> I
>>>>>>>>>>>> think
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> a really important use case that we need to
>>>> consider.
>>>>>>>> Take a
>>>>>>>>>>>> simple
>>>>>>>>>>>>>>>>> example... If I have data coming in from 3
>>>> different
>>>>>>>>> firewall
>>>>>>>>>>>>> vendors
>>>>>>>>>>>>>>>> and 2
>>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I
>>>> want
>>>>> to
>>>>>>>> be
>>>>>>>>>> able
>>>>>>>>>>> to
>>>>>>>>>>>>>>>> analyze
>>>>>>>>>>>>>>>>> that data set, I need the data to be indexed
>> all
>>>>>> together
>>>>>>>>>>> (likely
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> HDFS)
>>>>>>>>>>>>>>>>> and to have a normalized schema such that IP
>>>> address,
>>>>>>>> URL,
>>>>>>>>> and
>>>>>>>>>>>> user
>>>>>>>>>>>>>>> name
>>>>>>>>>>>>>>>>> (to take a few) can be easily queried and
>>>>> aggregated. I
>>>>>>>> can
>>>>>>>>>> also
>>>>>>>>>>>>>>> envision
>>>>>>>>>>>>>>>>> scenarios where I would want to index data
>> based
>>> on
>>>>>>>>> attributes
>>>>>>>>>>>> other
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for
>> example.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7
>>>> standard
>>>>>>>> fields
>>>>>>>>> to
>>>>>>>>>>>>> include
>>>>>>>>>>>>>>>>> things like URL and user. Is there community
>>>>>>>>> interest/support
>>>>>>>>>>> for
>>>>>>>>>>>>>>> moving
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Kyle
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
>>>>>>>>> mattf@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index
>> name
>>>>>> allows
>>>>>>>>>> using
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can
>> be
>>>>>>>> achieved.
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> --Matt
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" <
>>>>>>>> cestella@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog
>>> parser
>>>>>>>>> with
>>>>>>>>>>> data
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> sources 1
>>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue
>>> with 3
>>>>>>>>>> parsers
>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> queue, each picking part the messages from
>>> source
>>>>>>>> 1, 2
>>>>>>>>>> and
>>>>>>>>>>>> 3.
>>>>>>>>>>>>>>>> They'd
>>>>>>>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>> through separate enrichment and into the
>>> indexing
>>>>>>>>>>> topology.
>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> indexing topology, you could specify the same
>>>> index
>>>>>>>>> name
>>>>>>>>>>>>> "syslog"
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> of the messages go into the same index for
>> CEP
>>>>>>>>> querying
>>>>>>>>>> if
>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>> desired.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
>>>>>>>>>>>> mattf@apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I
>> worked
>>> at
>>>>>>>>>> LogLogic
>>>>>>>>>>>> in
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route
>>> different
>>>>>>>>> lines
>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> syslog
>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of
>>>> what
>>>>>>>>> the
>>>>>>>>>>>>> parsers
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and
>>>>>>>> annotate
>>>>>>>>>> it
>>>>>>>>>>> –
>>>>>>>>>>>>> eg,
>>>>>>>>>>>>>>>>>> src_ip_addr,
>>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata
>>> are
>>>>>>>>>>> annotated
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it
>> make
>>>>>>>> sense
>>>>>>>>> to
>>>>>>>>>>>> index
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk
>>> has
>>>>>>>>>>>> illustrated
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> model.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" <
>>>>>>>>>> cestella@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the
>> approach
>>>>>>>>> that
>>>>>>>>>>>> we've
>>>>>>>>>>>>>>> taken
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>>>> which aggregate different types of data is
>> to
>>>>>>>>>>> provide
>>>>>>>>>>>>>>> filters
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> parser
>>>>>>>>>>>>>>>>>>> level and have multiple parser topologies
>>>>>>>> (with
>>>>>>>>>>>>> different,
>>>>>>>>>>>>>>>>>> possibly
>>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This
>>>>>>>> would
>>>>>>>>>> be
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> completely
>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that
>>>>>>>>>>> aggregates
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>>>>>> pick
>>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is
>>>>>>>> why
>>>>>>>>> the
>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>> thought and
>>>>>>>>>>>>>>>>>>> architecture was one index per sensor.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt
>> Foley <
>>>>>>>>>>>>>>>> mattf@apache.org>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event
>>>>>>>>> Processing)
>>>>>>>>>>> is
>>>>>>>>>>>>>>>> contrary
>>>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor.
>>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors
>>>>>>>> are
>>>>>>>>>>> already
>>>>>>>>>>>>>>>>>> aggregating
>>>>>>>>>>>>>>>>>>> data from
>>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong
>> here.
>>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data
>>>>>>>> lake”
>>>>>>>>>>>> insights
>>>>>>>>>>>>>>> come
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> being able
>>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of
>>>>>>>> data
>>>>>>>>>>> rather
>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>> vertical
>>>>>>>>>>>>>>>>>>>> slices of it.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" <
>>>>>>>>>>>>> cestella@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hey Matt,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the comment!
>>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one
>>>>>>>> index
>>>>>>>>>> name,
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to
>>>>>>>> the
>>>>>>>>>>> user.
>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> sensor
>>>>>>>>>>>>>>>>>>>> specific,
>>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each
>>>>>>>>>> sensor.
>>>>>>>>>>>> If
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think
>>>>>>>>>>> carefully
>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>> to do
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I
>>>>>>>> guess I
>>>>>>>>>> can
>>>>>>>>>>>> see
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> use,
>>>>>>>>>>>>>>>>>> though
>>>>>>>>>>>>>>>>>>>> (redirect
>>>>>>>>>>>>>>>>>>>> messages to one index vs another based
>>>>>>>> on
>>>>>>>>> a
>>>>>>>>>>>>> predicate
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> a given
>>>>>>>>>>>>>>>>>>>> sensor).
>>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally
>>>>>>>>> thinking
>>>>>>>>>>> that
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> discussion
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> go,
>>>>>>>>>>>>>>>>>>>> but it's an interesting point.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the
>>>>>>>>>> implementation
>>>>>>>>>>>>> quite
>>>>>>>>>>>>>>>> yet,
>>>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that
>>>>>>>>>>> topology,
>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> spout
>>>>>>>>>>>>>>>>>>> that goes
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to
>>>>>>>> the
>>>>>>>>>> hdfs
>>>>>>>>>>>>> writer.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt
>>>>>>>>> Foley
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> mattf@apache.org>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like
>>>>>>>> this.
>>>>>>>>>>>> Couple
>>>>>>>>>>>>>>>>>> questions:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid”
>>>>>>>>>>> name/value
>>>>>>>>>>>>> pair,
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> index name
>>>>>>>>>>>>>>>>>>>>> expected to always be a sensor
>>>>>>>> name? Or
>>>>>>>>>> is
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>> json
>>>>>>>>>>>>>>>>>>> structure
>>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in
>>>>>>>>> zookeeper?
>>>>>>>>>>> Or
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>> arbitrary
>>>>>>>>>>>>>>>>>>>>> indexes with this new specification,
>>>>>>>>>>>>> independent of
>>>>>>>>>>>>>>>>>> sensor?
>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie
>>>>>>>>>>>>>>>>>>>>> { “indexes” : [
>>>>>>>>>>>>>>>>>>>>> {“index” : “name1”,
>>>>>>>>>>>>>>>>>>>>> …
>>>>>>>>>>>>>>>>>>>>> },
>>>>>>>>>>>>>>>>>>>>> {“index” : “name2”,
>>>>>>>>>>>>>>>>>>>>> …
>>>>>>>>>>>>>>>>>>>>> } ]
>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer
>>>>>>>>> selection
>>>>>>>>>>>> logic
>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>> place in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like
>>>>>>>> that
>>>>>>>>>>> would
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> smallest
>>>>>>>>>>>>>>>>>>>> impact on
>>>>>>>>>>>>>>>>>>>>> current implementation, no?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered
>>>>>>>> in
>>>>>>>>>>>> PR-415, I
>>>>>>>>>>>>>>>>> haven’t
>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>> time to
>>>>>>>>>>>>>>>>>>>>> review that one yet.
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> --Matt
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael
>>>>>>>>> Miklavcic"
>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> michael.miklavcic@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I like the flexibility and
>>>>>>>>>>> expressibility
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> Stellar
>>>>>>>>>>>>>>>>>>>>> filters.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> M
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM,
>>>>>>>>> Casey
>>>>>>>>>>>>> Stella <
>>>>>>>>>>>>>>>>>>>> cestella@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> As of METRON-652 <
>>>>>>>>>>>>> https://github.com/apache/
>>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we
>>>>>>>>>>>>>>>>>>>>>> will have decoupled the
>>>>>>>> indexing
>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>>>>>>> enrichment
>>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate
>>>>>>>>>>>> follow-up
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>> like to
>>>>>>>>>>>>>>>>>>>>> provide the
>>>>>>>>>>>>>>>>>>>>>> ability to turn off and on
>>>>>>>> writers
>>>>>>>>>> via
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> configs. I'd
>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>> to get
>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> community feedback on how the
>>>>>>>>>>>>> functionality
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>> work,
>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> y'all are
>>>>>>>>>>>>>>>>>>>>>> amenable. :)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible
>>>>>>>>>> writers
>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>> topology:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> - Solr
>>>>>>>>>>>>>>>>>>>>>> - Elasticsearch
>>>>>>>>>>>>>>>>>>>>>> - HDFS
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> HDFS is always used,
>>>>>>>> elasticsearch
>>>>>>>>>> or
>>>>>>>>>>>>> solr is
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> depending
>>>>>>>>>>>>>>>>>>>> on how
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> start the indexing topology.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to
>>>>>>>> mind
>>>>>>>>>>>>>>> immediately:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *Index Filtering*
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> You would be able to specify a
>>>>>>>>>> filter
>>>>>>>>>>> as
>>>>>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>>>>>>> stellar
>>>>>>>>>>>>>>>>>>>>> statement
>>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the
>>>>>>>>> StellarFilter
>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> exists
>>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>>>>> Parsers)
>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on
>>>>>>>> a
>>>>>>>>>>>>>>>>>> message-by-message basis
>>>>>>>>>>>>>>>>>>>> whether or
>>>>>>>>>>>>>>>>>>>>> not to
>>>>>>>>>>>>>>>>>>>>>> write the message.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> The semantics of this would be
>>>>>>>> as
>>>>>>>>>>>> follows:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> - Default (i.e.
>>>>>>>> unspecified) is
>>>>>>>>>> to
>>>>>>>>>>>> pass
>>>>>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>>>>>>> through
>>>>>>>>>>>>>>>>>>>> (hence
>>>>>>>>>>>>>>>>>>>>>> backwards compatible with
>>>>>>>> the
>>>>>>>>>>> current
>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>> config).
>>>>>>>>>>>>>>>>>>>>>> - Messages which have the
>>>>>>>>>>> associated
>>>>>>>>>>>>>>> stellar
>>>>>>>>>>>>>>>>>> statement
>>>>>>>>>>>>>>>>>>>> evaluate
>>>>>>>>>>>>>>>>>>>>> to true
>>>>>>>>>>>>>>>>>>>>>> for the writer type will be
>>>>>>>>>>> written,
>>>>>>>>>>>>>>>> otherwise
>>>>>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
>>>>>>>> would
>>>>>>>>>>> write
>>>>>>>>>>>>> out
>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> HDFS and
>>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> out only messages containing a
>>>>>>>>> field
>>>>>>>>>>>>> called
>>>>>>>>>>>>>>>>>> "field1":
>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
>>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
>>>>>>>>>>>>>>>>>>>>>> ,"filters" : {
>>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false"
>>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)"
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch*
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to
>>>>>>>>> just
>>>>>>>>>>>>> provide a
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> writers
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>>>> messages. The semantics would
>>>>>>>> be
>>>>>>>>> as
>>>>>>>>>>>>> follows:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> - If the list is
>>>>>>>> unspecified,
>>>>>>>>>> then
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> default
>>>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>>>>>> for every writer in the
>>>>>>>>> indexing
>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>>>>> - If the list is specified,
>>>>>>>>> then
>>>>>>>>>> a
>>>>>>>>>>>>> writer
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>>>>> if and
>>>>>>>>>>>>>>>>>>>>>> only if it is named in the
>>>>>>>>> list.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
>>>>>>>> turns
>>>>>>>>>> off
>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> keeps on
>>>>>>>>>>>>>>>>>>>>> Elasticsearch:
>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
>>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
>>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ]
>>> 
>>> --
>> 
>> Jon
>> 
>> Sent from my mobile device
>> 


Mime
View raw message