metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Casey Stella <ceste...@gmail.com>
Subject Re: [DISCUSS] Turning off indexing writers feature discussion
Date Fri, 13 Jan 2017 16:06:06 GMT
Dave,
For the benefit of posterity and people who might not be as deeply
entangled in the system as we have been, I'll recap things and hopefully
answer your question in the process.

Historically the index configuration is split between the enrichment
configs and the global configs.

   - The global configs really controls configs that apply to all sensors.
   Historically this has been stuff like index connection strings, etc.
   - The sensor-specific configs which control things that vary by sensor.

As of Metron-652 (in review currently), we moved the sensor specific
configs from the enrichment configs.  The proposal here is to increase the
granularity of the the sensor specific files to make them support index
writer-specific configs.  Right now in the indexing topology, we have 2
writers (fixed): ES/Solr and HDFS.

The proposed configuration would allow you to either specify a blanket
sensor-level config for the index name and batchSize and/or override at the
writer level, thereby supporting a couple of use-cases:

   - Turning off certain index writers (e.g. HDFS)
   - Filtering the messages written to certain index writers

The two competing configs between Nick and I are as follows:

   - I want to make sure we keep the old sensor-specific defaults with
   writer-specific overrides available
   - Nick thought we could simplify the permutations by making the indexing
   config only the writer-level configs.

My concerns about Nick's suggestion were that the default and majority
case, specifying the index and the batchSize for all writers (th eone we
support now) would require more configuration.

Nick's concerns about my suggestion were that it was overly complex and
hard to grok and that we could dispense with backwards compatibility and
make people do a bit more work on the default case for the benefits of a
simpler advanced case. (Nick, make sure I don't misstate your position).

Casey


On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <dlyle65535@gmail.com> wrote:

> Casey,
>
> Can you give me a level set of what your thinking is now? I think it's
> global control of all index types + overrides on a per-type basis. Fwiw,
> I'm totally for that, but I want to make sure I'm not imposing my
> pre-concieved notions on your consensus-driven ones.
>
> -D....
>
> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <cestella@gmail.com> wrote:
>
> > I am suggesting that, yes.  The configs are essentially the same as
> yours,
> > except there is an override specified at the top level.  Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each.  It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now).  Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <nick@nickallen.org> wrote:
> >
> > > Are you saying we support all of these variants?  I realize you are
> > trying
> > > to have some backwards compatibility, but this also makes it harder
> for a
> > > user to grok (for me at least).
> > >
> > > Personally I like my original example as there are fewer
> sub-structures,
> > > like 'writerConfig', which makes the whole thing simpler and easier to
> > > grok.  But maybe others will think your proposal is just as easy to
> grok.
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <cestella@gmail.com>
> > wrote:
> > >
> > > > Ok, so here's what I'm thinking based on the discussion:
> > > >
> > > >    - Keeping the configs that we have now (batchSize and index) as
> > > defaults
> > > >    for the unspecified writer-specific case
> > > >    - Adding the config Nick suggested
> > > >
> > > > *Base Case*:
> > > > {
> > > > }
> > > >
> > > >    - all writers write all messages
> > > >    - index named the same as the sensor for all writers
> > > >    - batchSize of 1 for all writers
> > > >
> > > > *Writer-non-specific case*:
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 100
> > > > }
> > > >
> > > >    - All writers write all messages
> > > >    - index is named "foo", different from the sensor for all writers
> > > >    - batchSize is 100 for all writers
> > > >
> > > > *Writer-specific case without filters*
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 1
> > > >  , "writerConfig" :
> > > >    {
> > > >       "elasticsearch" : {
> > > >                                    "batchSize" : 100
> > > >                                  }
> > > >    }
> > > > }
> > > >
> > > >    - All writers write all messages
> > > >    - index is named "foo", different from the sensor for all writers
> > > >    - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > >    - NOTE: I could override the index name too
> > > >
> > > > *Writer-specific case with filters*
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 1
> > > >  , "writerConfig" :
> > > >    {
> > > >       "elasticsearch" : {
> > > >                                    "batchSize" : 100,
> > > >                                    "when" : "exists(field1)"
> > > >                                  },
> > > >       "hdfs" : {
> > > >                      "when" : "false"
> > > >                   }
> > > >    }
> > > > }
> > > >
> > > >    - ES writer writes messages which have field1, HDFS doesn't
> > > >    - index is named "foo", different from the sensor for all writers
> > > >    - 100 for elasticsearch writers
> > > >
> > > > Thoughts?
> > > >
> > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <cduby@hortonworks.com
> >
> > > > wrote:
> > > >
> > > > > For larger installations you need to control what is indexed so you
> > > don’t
> > > > > end up with a nasty elastic search situation and so you can mine
> the
> > > data
> > > > > later for reports and training ml models.
> > > > >
> > > > > Thanks
> > > > > Carolyn
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 1/13/17, 9:40 AM, "Casey Stella" <cestella@gmail.com> wrote:
> > > > >
> > > > > >OH that's a good idea!
> > > > > >
> > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <nick@nickallen.org>
> > > wrote:
> > > > > >
> > > > > >> I like the "Index Filtering" option based on the flexibility
> that
> > it
> > > > > >> provides.  Should each output (HDFS, ES, etc) have its own
> > > > configuration
> > > > > >> settings?  For example, aren't things like batching handled
> > > separately
> > > > > for
> > > > > >> HDFS versus Elasticsearch?
> > > > > >>
> > > > > >> Something along the lines of...
> > > > > >>
> > > > > >> {
> > > > > >>   "hdfs" : {
> > > > > >>     "when": "exists(field1)",
> > > > > >>     "batchSize": 100
> > > > > >>   },
> > > > > >>
> > > > > >>   "elasticsearch" : {
> > > > > >>     "when": "true",
> > > > > >>     "batchSize": 1000,
> > > > > >>     "index": "squid"
> > > > > >>   }
> > > > > >> }
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> cestella@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > Yeah, I tend to like the first option too.  Any opposition
to
> > that
> > > > > from
> > > > > >> > anyone?
> > > > > >> >
> > > > > >> > The points brought up are good ones and I think that
it may be
> > > > worth a
> > > > > >> > broader discussion of the requirements of indexing
in a
> separate
> > > dev
> > > > > list
> > > > > >> > thread.  Maybe a list of desires with coherent use-cases
> > > justifying
> > > > > them
> > > > > >> so
> > > > > >> > we can think about how this stuff should work and where
the
> > > natural
> > > > > >> > extension points should be.  Afterall, we need to toe
the line
> > > > between
> > > > > >> > engineering and overengineering for features nobody
will want.
> > > > > >> >
> > > > > >> > I'm not sure about the extensions to the standard fields.
 I'm
> > > torn
> > > > > >> between
> > > > > >> > the notions that we should have no standard fields
vs we
> should
> > > > have a
> > > > > >> > boatload of standard fields (with most of them empty).
 I
> > exchange
> > > > > >> > positions fairly regularly on that question. ;)  It
may be
> > worth a
> > > > dev
> > > > > >> list
> > > > > >> > discussion to lay out how you imagine an extension
of standard
> > > > fields
> > > > > and
> > > > > >> > how it might look as implemented in Metron.
> > > > > >> >
> > > > > >> > Casey
> > > > > >> >
> > > > > >> > Casey
> > > > > >> >
> > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > > > > >> > kylerichardson2@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > I'll second my preference for the first option.
I think the
> > > > ability
> > > > > to
> > > > > >> > use
> > > > > >> > > Stellar filters to customize indexing would be
a big win.
> > > > > >> > >
> > > > > >> > > I'm glad Matt brought up the point about data
lake and CEP.
> I
> > > > think
> > > > > >> this
> > > > > >> > is
> > > > > >> > > a really important use case that we need to consider.
Take a
> > > > simple
> > > > > >> > > example... If I have data coming in from 3 different
> firewall
> > > > > vendors
> > > > > >> > and 2
> > > > > >> > > different web proxy/url filtering vendors and
I want to be
> > able
> > > to
> > > > > >> > analyze
> > > > > >> > > that data set, I need the data to be indexed all
together
> > > (likely
> > > > in
> > > > > >> > HDFS)
> > > > > >> > > and to have a normalized schema such that IP address,
URL,
> and
> > > > user
> > > > > >> name
> > > > > >> > > (to take a few) can be easily queried and aggregated.
I can
> > also
> > > > > >> envision
> > > > > >> > > scenarios where I would want to index data based
on
> attributes
> > > > other
> > > > > >> than
> > > > > >> > > sensor, business unit or subsidiary for example.
> > > > > >> > >
> > > > > >> > > I've been wanted to propose extending our 7 standard
fields
> to
> > > > > include
> > > > > >> > > things like URL and user. Is there community
> interest/support
> > > for
> > > > > >> moving
> > > > > >> > in
> > > > > >> > > that direction? If so, I'll start a new thread.
> > > > > >> > >
> > > > > >> > > Thanks!
> > > > > >> > >
> > > > > >> > > -Kyle
> > > > > >> > >
> > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> mattf@apache.org
> > >
> > > > > wrote:
> > > > > >> > >
> > > > > >> > > > Ah, I see.  If overriding the default index
name allows
> > using
> > > > the
> > > > > >> same
> > > > > >> > > > name for multiple sensors, then the goal
can be achieved.
> > > > > >> > > > Thanks,
> > > > > >> > > > --Matt
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <cestella@gmail.com>
> > > wrote:
> > > > > >> > > >
> > > > > >> > > >     Oh, you could!  Let's say you have a
syslog parser
> with
> > > data
> > > > > from
> > > > > >> > > > sources 1
> > > > > >> > > >     2 and 3.  You'd end up with one kafka
queue with 3
> > parsers
> > > > > >> attached
> > > > > >> > > to
> > > > > >> > > > that
> > > > > >> > > >     queue, each picking part the messages
from source 1, 2
> > and
> > > > 3.
> > > > > >> > They'd
> > > > > >> > > > go
> > > > > >> > > >     through separate enrichment and into
the indexing
> > > topology.
> > > > > In
> > > > > >> the
> > > > > >> > > >     indexing topology, you could specify
the same index
> name
> > > > > "syslog"
> > > > > >> > and
> > > > > >> > > > all
> > > > > >> > > >     of the messages go into the same index
for CEP
> querying
> > if
> > > > so
> > > > > >> > > desired.
> > > > > >> > > >
> > > > > >> > > >     On Thu, Jan 12, 2017 at 6:27 PM, Matt
Foley <
> > > > mattf@apache.org
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > >     > Syslog is hell on parsers – I
know, I worked at
> > LogLogic
> > > > in
> > > > > a
> > > > > >> > > > previous
> > > > > >> > > >     > life.  It makes perfect sense to
route different
> lines
> > > > from
> > > > > >> > syslog
> > > > > >> > > > through
> > > > > >> > > >     > different appropriate parsers. 
But a lot of what
> the
> > > > > parsers
> > > > > >> do
> > > > > >> > is
> > > > > >> > > >     > identify consistent subsets of metadata
and annotate
> > it
> > > –
> > > > > eg,
> > > > > >> > > > src_ip_addr,
> > > > > >> > > >     > event timestamps, etc.  Once those
metadata are
> > > annotated
> > > > > and
> > > > > >> > > > available
> > > > > >> > > >     > with common field names, why doesn’t
it make sense
> to
> > > > index
> > > > > the
> > > > > >> > > > messages
> > > > > >> > > >     > together, for CEP querying?  I think
Splunk has
> > > > illustrated
> > > > > >> this
> > > > > >> > > > model.
> > > > > >> > > >     >
> > > > > >> > > >     > On 1/12/17, 3:00 PM, "Casey Stella"
<
> > cestella@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> > > >     >
> > > > > >> > > >     >     yeah, I mean, honestly, I think
the approach
> that
> > > > we've
> > > > > >> taken
> > > > > >> > > for
> > > > > >> > > >     > sources
> > > > > >> > > >     >     which aggregate different types
of data is to
> > > provide
> > > > > >> filters
> > > > > >> > > at
> > > > > >> > > > the
> > > > > >> > > >     > parser
> > > > > >> > > >     >     level and have multiple parser
topologies (with
> > > > > different,
> > > > > >> > > > possibly
> > > > > >> > > >     >     mutually exclusive filters)
running.  This would
> > be
> > > a
> > > > > >> > > completely
> > > > > >> > > >     > separate
> > > > > >> > > >     >     sensor.  Imagine a syslog data
source that
> > > aggregates
> > > > > and
> > > > > >> you
> > > > > >> > > > want to
> > > > > >> > > >     > pick
> > > > > >> > > >     >     apart certain pieces of messages.
 This is why
> the
> > > > > initial
> > > > > >> > > > thought and
> > > > > >> > > >     >     architecture was one index per
sensor.
> > > > > >> > > >     >
> > > > > >> > > >     >     On Thu, Jan 12, 2017 at 5:55
PM, Matt Foley <
> > > > > >> > mattf@apache.org>
> > > > > >> > > > wrote:
> > > > > >> > > >     >
> > > > > >> > > >     >     > I’m thinking that CEP
(Complex Event
> Processing)
> > > is
> > > > > >> > contrary
> > > > > >> > > > to the
> > > > > >> > > >     > idea
> > > > > >> > > >     >     > of silo-ing data per sensor.
> > > > > >> > > >     >     > Now it’s true that some
of those sensors are
> > > already
> > > > > >> > > > aggregating
> > > > > >> > > >     > data from
> > > > > >> > > >     >     > multiple sources, so maybe
I’m wrong here.
> > > > > >> > > >     >     > But it just seems to me
that the “data lake”
> > > > insights
> > > > > >> come
> > > > > >> > > from
> > > > > >> > > >     > being able
> > > > > >> > > >     >     > to make decisions over
the whole mass of data
> > > rather
> > > > > than
> > > > > >> > > just
> > > > > >> > > >     > vertical
> > > > > >> > > >     >     > slices of it.
> > > > > >> > > >     >     >
> > > > > >> > > >     >     > On 1/12/17, 2:15 PM, "Casey
Stella" <
> > > > > cestella@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >     Hey Matt,
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >     Thanks for the comment!
> > > > > >> > > >     >     >     1. At the moment, we
only have one index
> > name,
> > > > the
> > > > > >> > > default
> > > > > >> > > > of
> > > > > >> > > >     > which is
> > > > > >> > > >     >     > the
> > > > > >> > > >     >     >     sensor name but that's
entirely up to the
> > > user.
> > > > > This
> > > > > >> > is
> > > > > >> > > > sensor
> > > > > >> > > >     >     > specific,
> > > > > >> > > >     >     >     so it'd be a separate
config for each
> > sensor.
> > > > If
> > > > > we
> > > > > >> > want
> > > > > >> > > > to
> > > > > >> > > >     > build
> > > > > >> > > >     >     > multiple
> > > > > >> > > >     >     >     indices per sensor,
we'd have to think
> > > carefully
> > > > > >> about
> > > > > >> > > how
> > > > > >> > > > to do
> > > > > >> > > >     > that
> > > > > >> > > >     >     > and
> > > > > >> > > >     >     >     would be a bigger undertaking.
 I guess I
> > can
> > > > see
> > > > > the
> > > > > >> > > use,
> > > > > >> > > > though
> > > > > >> > > >     >     > (redirect
> > > > > >> > > >     >     >     messages to one index
vs another based on
> a
> > > > > predicate
> > > > > >> > for
> > > > > >> > > > a given
> > > > > >> > > >     >     > sensor).
> > > > > >> > > >     >     >     Anyway, not where I
was originally
> thinking
> > > that
> > > > > this
> > > > > >> > > > discussion
> > > > > >> > > >     > would
> > > > > >> > > >     >     > go,
> > > > > >> > > >     >     >     but it's an interesting
point.
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >     2. I hadn't thought
through the
> > implementation
> > > > > quite
> > > > > >> > yet,
> > > > > >> > > > but we
> > > > > >> > > >     > don't
> > > > > >> > > >     >     >     actually have a splitter
bolt in that
> > > topology,
> > > > > just
> > > > > >> a
> > > > > >> > > > spout
> > > > > >> > > >     > that goes
> > > > > >> > > >     >     > to
> > > > > >> > > >     >     >     the elasticsearch writer
and also to the
> > hdfs
> > > > > writer.
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >     On Thu, Jan 12, 2017
at 4:52 PM, Matt
> Foley
> > <
> > > > > >> > > > mattf@apache.org>
> > > > > >> > > >     > wrote:
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >     > Casey, good to
have controls like this.
> > > > Couple
> > > > > >> > > > questions:
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     > 1. Regarding the
“index” : “squid”
> > > name/value
> > > > > pair,
> > > > > >> > is
> > > > > >> > > > the
> > > > > >> > > >     > index name
> > > > > >> > > >     >     >     > expected to always
be a sensor name?  Or
> > is
> > > > the
> > > > > >> given
> > > > > >> > > > json
> > > > > >> > > >     > structure
> > > > > >> > > >     >     >     > subordinate to
a sensor name in
> zookeeper?
> > > Or
> > > > > can
> > > > > >> we
> > > > > >> > > > build
> > > > > >> > > >     > arbitrary
> > > > > >> > > >     >     >     > indexes with this
new specification,
> > > > > independent of
> > > > > >> > > > sensor?
> > > > > >> > > >     > Should
> > > > > >> > > >     >     > there
> > > > > >> > > >     >     >     > actually be a
list of “indexes”, ie
> > > > > >> > > >     >     >     > { “indexes”
: [
> > > > > >> > > >     >     >     >         {“index”
: “name1”,
> > > > > >> > > >     >     >     >              
  …
> > > > > >> > > >     >     >     >         },
> > > > > >> > > >     >     >     >         {“index”
: “name2”,
> > > > > >> > > >     >     >     >              
  …
> > > > > >> > > >     >     >     >         } ]
> > > > > >> > > >     >     >     > }
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     > 2. Would the filtering
/ writer
> selection
> > > > logic
> > > > > >> take
> > > > > >> > > > place in
> > > > > >> > > >     > the
> > > > > >> > > >     >     > indexing
> > > > > >> > > >     >     >     > topology splitter
bolt?  Seems like that
> > > would
> > > > > have
> > > > > >> > the
> > > > > >> > > >     > smallest
> > > > > >> > > >     >     > impact on
> > > > > >> > > >     >     >     > current implementation,
no?
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     > Sorry if these
are already answered in
> > > > PR-415, I
> > > > > >> > > haven’t
> > > > > >> > > > had
> > > > > >> > > >     > time to
> > > > > >> > > >     >     >     > review that one
yet.
> > > > > >> > > >     >     >     > Thanks,
> > > > > >> > > >     >     >     > --Matt
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     > On 1/12/17, 12:55
PM, "Michael
> Miklavcic"
> > <
> > > > > >> > > >     >     > michael.miklavcic@gmail.com>
> > > > > >> > > >     >     >     > wrote:
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >     I like the
flexibility and
> > > expressibility
> > > > of
> > > > > >> the
> > > > > >> > > > first
> > > > > >> > > >     > option
> > > > > >> > > >     >     > with
> > > > > >> > > >     >     >     > Stellar
> > > > > >> > > >     >     >     >     filters.
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >     M
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >     On Thu, Jan
12, 2017 at 1:51 PM,
> Casey
> > > > > Stella <
> > > > > >> > > >     >     > cestella@gmail.com>
> > > > > >> > > >     >     >     > wrote:
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >     > As of
METRON-652 <
> > > > > https://github.com/apache/
> > > > > >> > > >     >     >     > incubator-metron/pull/415>,
we
> > > > > >> > > >     >     >     >     > will
have decoupled the indexing
> > > > > >> configuration
> > > > > >> > > > from the
> > > > > >> > > >     >     > enrichment
> > > > > >> > > >     >     >     >     > configuration.
 As an immediate
> > > > follow-up
> > > > > to
> > > > > >> > > that,
> > > > > >> > > > I'd
> > > > > >> > > >     > like to
> > > > > >> > > >     >     >     > provide the
> > > > > >> > > >     >     >     >     > ability
to turn off and on writers
> > via
> > > > the
> > > > > >> > > > configs.  I'd
> > > > > >> > > >     > like
> > > > > >> > > >     >     > to get
> > > > > >> > > >     >     >     > some
> > > > > >> > > >     >     >     >     > community
feedback on how the
> > > > > functionality
> > > > > >> > > should
> > > > > >> > > > work,
> > > > > >> > > >     > if
> > > > > >> > > >     >     > y'all are
> > > > > >> > > >     >     >     >     > amenable.
:)
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > As of
now, we have 3 possible
> > writers
> > > > > which
> > > > > >> can
> > > > > >> > > be
> > > > > >> > > > used
> > > > > >> > > >     > in the
> > > > > >> > > >     >     >     > indexing
> > > > > >> > > >     >     >     >     > topology:
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     >    -
Solr
> > > > > >> > > >     >     >     >     >    -
Elasticsearch
> > > > > >> > > >     >     >     >     >    -
HDFS
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > HDFS
is always used, elasticsearch
> > or
> > > > > solr is
> > > > > >> > > used
> > > > > >> > > >     > depending
> > > > > >> > > >     >     > on how
> > > > > >> > > >     >     >     > you
> > > > > >> > > >     >     >     >     > start
the indexing topology.
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > A couple
of proposals come to mind
> > > > > >> immediately:
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > *Index
Filtering*
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > You would
be able to specify a
> > filter
> > > as
> > > > > >> > defined
> > > > > >> > > > by a
> > > > > >> > > >     > stellar
> > > > > >> > > >     >     >     > statement
> > > > > >> > > >     >     >     >     > (likely
a reuse of the
> StellarFilter
> > > > that
> > > > > >> > exists
> > > > > >> > > > in the
> > > > > >> > > >     >     > Parsers)
> > > > > >> > > >     >     >     > which
> > > > > >> > > >     >     >     >     > would
allow you to indicate on a
> > > > > >> > > > message-by-message basis
> > > > > >> > > >     >     > whether or
> > > > > >> > > >     >     >     > not to
> > > > > >> > > >     >     >     >     > write
the message.
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > The semantics
of this would be as
> > > > follows:
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     >    -
Default (i.e. unspecified) is
> > to
> > > > pass
> > > > > >> > > > everything
> > > > > >> > > >     > through
> > > > > >> > > >     >     > (hence
> > > > > >> > > >     >     >     >     >    backwards
compatible with the
> > > current
> > > > > >> > default
> > > > > >> > > > config).
> > > > > >> > > >     >     >     >     >    -
Messages which have the
> > > associated
> > > > > >> stellar
> > > > > >> > > > statement
> > > > > >> > > >     >     > evaluate
> > > > > >> > > >     >     >     > to true
> > > > > >> > > >     >     >     >     >    for
the writer type will be
> > > written,
> > > > > >> > otherwise
> > > > > >> > > > not.
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > Sample
indexing config which would
> > > write
> > > > > out
> > > > > >> no
> > > > > >> > > > messages
> > > > > >> > > >     > to
> > > > > >> > > >     >     > HDFS and
> > > > > >> > > >     >     >     > write
> > > > > >> > > >     >     >     >     > out only
messages containing a
> field
> > > > > called
> > > > > >> > > > "field1":
> > > > > >> > > >     >     >     >     > {
> > > > > >> > > >     >     >     >     >    "index"
: "squid"
> > > > > >> > > >     >     >     >     >   ,"batchSize"
: 100
> > > > > >> > > >     >     >     >     >   ,"filters"
: {
> > > > > >> > > >     >     >     >     >     
 "HDFS" : "false"
> > > > > >> > > >     >     >     >     >     
,"ES" : "exists(field1)"
> > > > > >> > > >     >     >     >     >     
            }
> > > > > >> > > >     >     >     >     > }
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > *Index
On/Off Switch*
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > A simpler
solution would be to
> just
> > > > > provide a
> > > > > >> > > list
> > > > > >> > > > of
> > > > > >> > > >     > writers
> > > > > >> > > >     >     > to
> > > > > >> > > >     >     >     > write
> > > > > >> > > >     >     >     >     > messages.
 The semantics would be
> as
> > > > > follows:
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     >    -
If the list is unspecified,
> > then
> > > > the
> > > > > >> > default
> > > > > >> > > > is to
> > > > > >> > > >     > write
> > > > > >> > > >     >     > all
> > > > > >> > > >     >     >     > messages
> > > > > >> > > >     >     >     >     >    for
every writer in the
> indexing
> > > > > topology
> > > > > >> > > >     >     >     >     >    -
If the list is specified,
> then
> > a
> > > > > writer
> > > > > >> > will
> > > > > >> > > > write
> > > > > >> > > >     > all
> > > > > >> > > >     >     > messages
> > > > > >> > > >     >     >     > if and
> > > > > >> > > >     >     >     >     >    only
if it is named in the
> list.
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > Sample
indexing config which turns
> > off
> > > > > HDFS
> > > > > >> and
> > > > > >> > > > keeps on
> > > > > >> > > >     >     >     > Elasticsearch:
> > > > > >> > > >     >     >     >     > {
> > > > > >> > > >     >     >     >     >    "index"
: "squid"
> > > > > >> > > >     >     >     >     >   ,"batchSize"
: 100
> > > > > >> > > >     >     >     >     >   ,"writers"
: [ "ES" ]
> > > > > >> > > >     >     >     >     > }
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > Thanks
in advance for the
> feedback!
> > > > > Also, if
> > > > > >> > you
> > > > > >> > > > have
> > > > > >> > > >     > any
> > > > > >> > > >     >     > other,
> > > > > >> > > >     >     >     > better
> > > > > >> > > >     >     >     >     > ideas
than the ones presented
> here,
> > > let
> > > > me
> > > > > >> know
> > > > > >> > > > too.
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > Best,
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >     > Casey
> > > > > >> > > >     >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >     >
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >
> > > > > >> > > >     >     >
> > > > > >> > > >     >
> > > > > >> > > >     >
> > > > > >> > > >     >
> > > > > >> > > >     >
> > > > > >> > > >     >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Nick Allen <nick@nickallen.org>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Nick Allen <nick@nickallen.org>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message