metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Leet <justinjl...@gmail.com>
Subject Re: [DISCUSS] System time vs. Event Time
Date Thu, 02 Mar 2017 20:54:35 GMT
I'm just going to throw out a few of questions, that I don't have good
answers to.  Casey and Nick, given your familiarity with the systems
involved, do you have any thoughts?

   - What's the smallest unit of work we can do to enable at least a useful
   subset of a fully featured term batch process? Looking at it from another
   angle, which of the use cases (either that Nick listed, or that anyone else
   has) gives us the best value for our effort?
   - Can we also do things like limiting support for the interdependencies
   Casey mentioned? If we do approach it that way, how do we avoid setting
   ourselves up for issues parallelizing the more complicated cases?  It
   sounds like we'll need to brainstorm some of the dependency stuff anyway.
   - Are there places right now (like the elasticsearch jira) where we need
   or want to make changes to either fix, or improve, or enable some of the
   larger pictures work?

Jon, any other thoughts?  Sounded like you were waiting to see how things
played out a bit, so if you have any insight, I'd love to hear it.

Justin

On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet <justinjleet@gmail.com> wrote:

> @Jon, it looks like it is based on system date.
>
> From ElasticsearchWriter.write:
> String indexPostfix = dateFormat.format(new Date());
> ...
> indexName = indexName + "_index_" + indexPostfix;
> ...
> IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
> sensorType + "_doc");
>
> Justin
>
> On Tue, Feb 28, 2017 at 10:44 AM, Zeolla@GMail.com <zeolla@gmail.com>
> wrote:
>
>> I'm actually a bit surprised to see METRON-691, because I know a while
>> back
>> I did some experiments to ensure that data was being written to the
>> indexes
>> that relate to the timestamp in the message, not the current time, and I
>> thought that messages were getting written to the proper historical
>> indexes, not the current one.  This was so long ago now, though, that it
>> would require another look, and I only reviewed it operationally (put
>> message on topic with certain timestamp, search for it in kibana).
>>
>> If that is not the case currently (which I should be able to verify later
>> this week) then that would be pretty concerning and somewhat separate from
>> the previous "Metron Batch" style discussions, which are more focused on
>> data bulk load or historical analysis.
>>
>> I will wait to see how the rest of this conversation pans out before
>> giving
>> my thoughts on the bigger picture.
>>
>> Jon
>>
>> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella <cestella@gmail.com> wrote:
>>
>> > I think this is a really tricky topic, but necessary.  I've given it a
>> bit
>> > of thought over the last few months and I don't really see a great way
>> to
>> > do it given the Profiler.  Here's what I've come up with so far,
>> though, in
>> > my thinking.
>> >
>> >
>> >    - Replaying events will compress events in time (e.g. 2 years of data
>> >    may come through in 10 minutes)
>> >    - Replaying events may result in events being out of order temporally
>> >    even if it is written to kafka in order (just by virtue of hitting a
>> >    different kafka partition)
>> >
>> > Given both of these, in my mind we should handle replaying of data *not*
>> > within a streaming context so we can control the order and the grouping
>> of
>> > the data.  In my mind, this is essentially the advent of batch Metron.
>> Off
>> > the top of my head, I'm having trouble thinking about how to parallelize
>> > this, however, in a pretty manner.
>> >
>> > Imagine a scenario where telemetry A has an enrichment E1 that depends
>> on
>> > profile P1 and profile P1 depends on the previous 10 minutes of data.
>> How
>> > in a batch or streaming context can we ever hope to ensure that the
>> > profiles for P1 for the last 10 minutes are in place as data flows
>> through
>> > across all data points? Now how about if the values that P1 depend on
>> are
>> > computed from a profile P2?  Essentially you have a data dependency
>> graph
>> > between enrichments and profiles and raw data that you need to work in
>> > order.
>> >
>> >
>> >
>> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <justinjleet@gmail.com>
>> > wrote:
>> >
>> > > There's a couple JIRAs related to the use of system time vs event
>> time.
>> > >
>> > > METRON-590 Enable Use of Event Time in Profiler
>> > > <https://issues.apache.org/jira/browse/METRON-590>
>> > > METRON-691 Elastic Writer index partitions on system time, not event
>> time
>> > > <https://issues.apache.org/jira/browse/METRON-691>
>> > >
>> > > Is there anything else that needs to be making this distinction, and
>> if
>> > so,
>> > > do we need to be able to support both system time and event time for
>> it?
>> > >
>> > > My immediate thought on this is that, once we work on replaying
>> > historical
>> > > data, we'll want system time for geo data passing through.  Given that
>> > the
>> > > geo files can update, we'd want to know which geo file we actually
>> need
>> > to
>> > > be using at the appropriate time.
>> > >
>> > > We'll probably also want to double check anything else that writes out
>> > data
>> > > to a location and provides some sort of timestamping on it.
>> > >
>> > > Justin
>> > >
>> >
>> --
>>
>> Jon
>>
>> Sent from my mobile device
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message