metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] System time vs. Event Time
Date Thu, 02 Mar 2017 22:24:42 GMT
Before the thought becomes obsolete, I’d like to say that I agree with Nick about the replay
scenario and threat signature databases.  I think a principal use case is replaying old data
with new threat signatures, to detect problems that were undetectable at the time they happened.
 The use case Casey brought up, where you want to reproduce the exact behavior of an earlier
PiT of your system, including using the threat signature database versions that were installed
at that time, would also be useful for debugging, system understanding, and testing, but I
think it is lower priority than the former.

Another high priority use case is replaying data with new Profiler configurations, to answer
questions that we hadn’t thought about asking before.

So, Justin, I think the minimum amount of work for a useful batch process, is to:
(a) Make sure event time rather than system time is usable, if not the default, in all components
that record, manipulate, or select based on timestamps.
(b) Enable a chunk of data, defined by our shiny new time window DSL, to be output in chron
order from sources that store whole messages (HDFS, PCAP, maybe Solr/ES, maybe raw data files
with a time window filter), and routed into a kafka topic, with throttling so kafka doesn’t
try to swallow several TB at once.
(c) Which can then be read by a Parser, and the result piped through the whole system, all
the way to threat detection, profiling, and filtered re-recording.
(d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged” somehow with a
batch identifier, both so it doesn’t get mixed up with all the other data from that event
time, and so it can be bulk-deleted if you made a mistake and asked for TB’s of the wrong
data.

An interesting part of (c) is that we don’t really want the “batch” to interfere with
on-going real-time processing.  Ideally the mechanism would also deal with data analysts submitting
multiple batch requests at the same time (altho admittedly that could be handled with a queue).

Is it sufficient to simply depend on the event time stamp to route stuff appropriately?  That
doesn’t seem to meet (d).  We could effectively “virtualize” the batch job by suffixing
the kafka topic names for the whole data flow related to a batch.  Batch id “foley3256”,
being a bunch of bro messages, could enter the Bro Parser on topic bro_foley3256.  To carry
this through to enrichment, etc., maybe it is sufficient to record the sensorType as “bro_foley3256”,
or maybe it should be sensorType “bro” on kafka topic “enrichment_foley3256”.  Such
schemes could satisfy (d) above, also.  Obviously there’s a lot of possible variations on
this theme.  What do you think?

--Matt

On 3/2/17, 12:54 PM, "Justin Leet" <justinjleet@gmail.com> wrote:

    I'm just going to throw out a few of questions, that I don't have good
    answers to.  Casey and Nick, given your familiarity with the systems
    involved, do you have any thoughts?
    
       - What's the smallest unit of work we can do to enable at least a useful
       subset of a fully featured term batch process? Looking at it from another
       angle, which of the use cases (either that Nick listed, or that anyone else
       has) gives us the best value for our effort?
       - Can we also do things like limiting support for the interdependencies
       Casey mentioned? If we do approach it that way, how do we avoid setting
       ourselves up for issues parallelizing the more complicated cases?  It
       sounds like we'll need to brainstorm some of the dependency stuff anyway.
       - Are there places right now (like the elasticsearch jira) where we need
       or want to make changes to either fix, or improve, or enable some of the
       larger pictures work?
    
    Jon, any other thoughts?  Sounded like you were waiting to see how things
    played out a bit, so if you have any insight, I'd love to hear it.
    
    Justin
    
    On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet <justinjleet@gmail.com> wrote:
    
    > @Jon, it looks like it is based on system date.
    >
    > From ElasticsearchWriter.write:
    > String indexPostfix = dateFormat.format(new Date());
    > ...
    > indexName = indexName + "_index_" + indexPostfix;
    > ...
    > IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
    > sensorType + "_doc");
    >
    > Justin
    >
    > On Tue, Feb 28, 2017 at 10:44 AM, Zeolla@GMail.com <zeolla@gmail.com>
    > wrote:
    >
    >> I'm actually a bit surprised to see METRON-691, because I know a while
    >> back
    >> I did some experiments to ensure that data was being written to the
    >> indexes
    >> that relate to the timestamp in the message, not the current time, and I
    >> thought that messages were getting written to the proper historical
    >> indexes, not the current one.  This was so long ago now, though, that it
    >> would require another look, and I only reviewed it operationally (put
    >> message on topic with certain timestamp, search for it in kibana).
    >>
    >> If that is not the case currently (which I should be able to verify later
    >> this week) then that would be pretty concerning and somewhat separate from
    >> the previous "Metron Batch" style discussions, which are more focused on
    >> data bulk load or historical analysis.
    >>
    >> I will wait to see how the rest of this conversation pans out before
    >> giving
    >> my thoughts on the bigger picture.
    >>
    >> Jon
    >>
    >> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella <cestella@gmail.com> wrote:
    >>
    >> > I think this is a really tricky topic, but necessary.  I've given it a
    >> bit
    >> > of thought over the last few months and I don't really see a great way
    >> to
    >> > do it given the Profiler.  Here's what I've come up with so far,
    >> though, in
    >> > my thinking.
    >> >
    >> >
    >> >    - Replaying events will compress events in time (e.g. 2 years of data
    >> >    may come through in 10 minutes)
    >> >    - Replaying events may result in events being out of order temporally
    >> >    even if it is written to kafka in order (just by virtue of hitting a
    >> >    different kafka partition)
    >> >
    >> > Given both of these, in my mind we should handle replaying of data *not*
    >> > within a streaming context so we can control the order and the grouping
    >> of
    >> > the data.  In my mind, this is essentially the advent of batch Metron.
    >> Off
    >> > the top of my head, I'm having trouble thinking about how to parallelize
    >> > this, however, in a pretty manner.
    >> >
    >> > Imagine a scenario where telemetry A has an enrichment E1 that depends
    >> on
    >> > profile P1 and profile P1 depends on the previous 10 minutes of data.
    >> How
    >> > in a batch or streaming context can we ever hope to ensure that the
    >> > profiles for P1 for the last 10 minutes are in place as data flows
    >> through
    >> > across all data points? Now how about if the values that P1 depend on
    >> are
    >> > computed from a profile P2?  Essentially you have a data dependency
    >> graph
    >> > between enrichments and profiles and raw data that you need to work in
    >> > order.
    >> >
    >> >
    >> >
    >> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <justinjleet@gmail.com>
    >> > wrote:
    >> >
    >> > > There's a couple JIRAs related to the use of system time vs event
    >> time.
    >> > >
    >> > > METRON-590 Enable Use of Event Time in Profiler
    >> > > <https://issues.apache.org/jira/browse/METRON-590>
    >> > > METRON-691 Elastic Writer index partitions on system time, not event
    >> time
    >> > > <https://issues.apache.org/jira/browse/METRON-691>
    >> > >
    >> > > Is there anything else that needs to be making this distinction, and
    >> if
    >> > so,
    >> > > do we need to be able to support both system time and event time for
    >> it?
    >> > >
    >> > > My immediate thought on this is that, once we work on replaying
    >> > historical
    >> > > data, we'll want system time for geo data passing through.  Given that
    >> > the
    >> > > geo files can update, we'd want to know which geo file we actually
    >> need
    >> > to
    >> > > be using at the appropriate time.
    >> > >
    >> > > We'll probably also want to double check anything else that writes
out
    >> > data
    >> > > to a location and provides some sort of timestamping on it.
    >> > >
    >> > > Justin
    >> > >
    >> >
    >> --
    >>
    >> Jon
    >>
    >> Sent from my mobile device
    >>
    >
    >
    



Mime
View raw message