metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: [DISCUSS] Adding new fields to stored records
Date Wed, 29 Mar 2017 14:23:47 GMT
Something else worth considering here is the implications for a multi-tenant environment. For
multi-tenant Metron right now you can use things like Ranger to control access to given index
storage locations by user. This would imply that we also need to either co-locate deltas with
those locations or risk doubling the administration load of ranger policies. If we go for
storage of delta logs in HBase for instance we will require matching policies for HBase, probably
best separating the tables by sensor to match the way HDFS stores data. Right no Metron doesn’t
do anything around this, but it’s likely to come up more in the future, so a little planning
seems a good idea. 

Simon

> On 27 Mar 2017, at 13:12, Simon Elliston Ball <Simon@simonellistonball.com> wrote:
> 
> Many thanks for starting off this discussion. Today in Metron we make a basic assumption
that once the data is written it stays written. All our enrichments and modifications happen
in the stream before landing in an immutable store, and this is something we need to maintain.
> 
> However, as we start to look at integration use cases, and the idea of providing an interactive
UI to investigators using the platform we need to capture additional data about events:
> human entered data (small scale)
> has this alert been seen
> escalated to a case system
> manually combined with other alerts
> machine generated data (large scale):
> restatement of threat feeds
> batch analytics too expensive to fit in stream
> These require some mutability to the stream. However, I would argue that we must maintain
that all mutability to Metron data is additive. Once data is stated, we should not restate
it in order to maintain integrity of the record provided by Metron, which is a key value for
security departments. 
> 
> In the case of the ‘post-indexing’ data we are expecting this to be a smaller profile
than the telemetry, since it is mostly human scale. That said, we still have challenges when
reading that data. Essentially it provides a delta overlay on the core indexed data which
needs to be checked for a significant number of operations, create in effect a join condition
for many queries. The primary query sources are going to be interactive UIs for things like
alert status, for which an HBase or search index makes a lot of sense. However, we will also
need to be able to access these efficiently in batch for things like relevancy modelling and
capturing feedback for human-in-the-loop style models. On that basis, I would argue that something
that’s easy to join to the HDFS index in Spark is also essential. HBase would be a candidate
here.
> 
> The format of the stored mutation data also needs to be considered. Since it is likely
to involve a relatively small number of modifications, and in keeping with the principal of
immutability and preservation of provenance, I would suggest the mutations are stored as a
timestamped transaction log against the original message. We may also want a current state
representation. It makes sense to me to store the log in HBase while the current state is
updated against the original message into ES / Solr depending on your search index of choice.

> 
> Looking at the idea of storing the log in HBase, we would have to consider schema. I
would recommend keyed by message guid, columns based versioning by timestamp or some sort
of vector clock, depending on the expected volume and variance of changes, which I would expect
to be low. Alternatively we could look at something like the opentsdb schema, with guid and
partial timestamp in the key if we’re expecting high volumes (this seems very unlikely to
me). 
> 
> Another option, similar to Raghu’s sidecar files, is to borrow the architecture of
Hive updates, which is to write sidecar delta files, which are checked in every query to the
underlying file for modifications, and to periodically compact. This makes sense but for our
need for immutability. Compaction could be done in batch to the original record file, and
would only add fields in the log form to that. We can get away with this optimisation over
the Hive method, since we are never looking to change original values, but only ‘after the
index’ values.  That said, compaction is still likely to be heavy and full of potential
problems with things like stripe and block alignment for performance (maybe there is something
we can learn here from the early problems with Hive acid if we go down that route). Personally
I see this as a high risk option.
> 
> Something I would like to consider is how we abstract this from the metron UI and other
metron users. I would recommend we deliver a data services layer API covering access to all
the underlying data and controlling the immutability, and maintenance of whatever persistence
we use. I would also like to see a Spark relation built for Metron to abstract data access
on the backend of Spark jobs which would allow us to decouple things like model building from
the underlying mechanisms and file formats. 
> 
> The short version is that I would say we store a transaction log in HBase and consider
mutating the document in search. 
> 
> Simon
> 
> 
>> On 27 Mar 2017, at 10:26, Raghu Mitra Kandikonda <rksv@hortonworks.com <mailto:rksv@hortonworks.com>>
wrote:
>> 
>> Hi All,
>> 
>> I would like to start a discussion around what would be the good approach to append
data to the existing records that are processed by Metron. Here are few thoughts that I have
to start with.
>> 
>> 1.Store the new fields just in ES and allow records to be different in ES and HDFS.
>> 2.Store the new fields in HBASE along with ES.
>> a.We can create a new table in HBASE that stores  guid + key (or any other unique
key of the record) and the new value.
>> b. The table name will be same as the file name that originally contained the record.
>> 3. Store new fields in ES and in HDFS.
>> a. The new fields will be stored in same file as the original record.
>> b. The new fields are stored along with guid of the record.
>> c. Any changes to the values of the fields will have a new record instead of modifying
the existing record.
>> d. To read the latest value for a record we need to parse the entire file.
>> Ex: File  enrichment-null-0-0-1490335748664.json has 3 records
>> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” , “guid”
: “id1"}
>> {“key1”: “value11, “key2”: “value21”, “key3”: “value31”  ,
“guid” : “id2"}
>> {“key1”: “value12, “key2”: “value22”, “key3”: “value32”  ,
“guid” : “id3"}
>> Now we have to store new field for record with guid id2 the new file looks as follows
>> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” }
>> {“key1”: “value11, “key2”: “value21”, “key3”: “value31” }
>> {“key1”: “value12, “key2”: “value22”, “key3”: “value32” }
>> {“guid”: “id2", “newKey”: “newValue”}
>> Again the value of newKey for record has been changed to newestValue the new file
looks as follows
>> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” }
>> {“key1”: “value11, “key2”: “value21”, “key3”: “value31” }
>> {“key1”: “value12, “key2”: “value22”, “key3”: “value32” }
>> {“guid”: “id2", “newKey”: “newValue”}
>> {“guid”: “id2", “newKey”: “newestValue”}
>> 4. Store the new fields in ES and in HDFS.
>> a. The new fields will be stored in new file than the file where the record originally
existed.
>> b. The name of file will be the same  as the file where the record is originally
present but it will be in a different folder.
>> c. The new fields are stored along with guid of the record.
>> c. new value to an existing field or a new field would be appended to the end of
the file instead of modifying a record.
>> d. To read the latest value for a record we need to parse the entire file.
>> Ex: File  /apps/metron/indexing/indexed/snort/enrichment-null-0-0-1490335746765.json
has following records
>> {“key1”: “value1, “key2”: “value2”, “key3”: “value3” , “guid”
: “id1"}
>> {“key1”: “value11, “key2”: “value21”, “key3”: “value31”  ,
“guid” : “id2"}
>> {“key1”: “value12, “key2”: “value22”, “key3”: “value32”  ,
“guid” : “id3"}
>> Now we have a ’newKey’ and ’newValue’ to be stored for record with guid id2.
The file enrichment-null-0-0-1490335748664.json will look the same but we will have a new
file
>> /apps/metron/augmented/snort/enrichment-null-0-0-1490335746765.json with the following
content
>> {“guid”: “id2", “newKey”: “newValue”}
>> Again the value of newKey is changed to  newestValue  and there is a new key called
newestKey the file looks as follows
>> {“guid”: “id2", “newKey”: “newValue”}
>> {“guid”: “id2", “newKey”: “newestValue”}
>> {“guid”: “id2", “newestKey”: “nextNewestValue”}
>> 
>> -Raghu
>> 
>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message