metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: [DISCUSS] parser ES + Solr schema abstraction
Date Wed, 23 May 2018 18:00:21 GMT
There is certainly a lot of value in the idea of tagging the data with a config version of
some sort for traceability. This is probably a per topology it goes through thing that gives
us detailed lineage. Maybe something like the NiFi provenance approach and a link to a lineage
store like atlas would make sense (in our case that’s simpler than the NiFi use case of
course since we have set topologies).

My other use for schema versions is around preserving backward compatibility for schema in
stores that need to think harder about schema evolution such as columnar formats in hdfs (orc
or parquet for example) so I think we need some means of storing and retrieving schema versions.

I’m proposing that the versions be created on the basis of config changes. So the process
would be config change triggering schema inference triggering diff to old schema optionally
triggering a net new version. 

Does they make sense?

Simon 

> On 22 May 2018, at 19:33, Otto Fowler <ottobackwards@gmail.com> wrote:
> 
> I’ve also talked with J. Zeolla conceptually storing data in hdfs relative to the version
of the schema to produced it, but that may not matter….
> 
> So Simon, do you mean that as part of taking a configuration change ( either startup
or live while running ) we ‘update’ the metadata/schema, or re-evaluate and then save/version
it?
> maybe the data should have a field about the config/schema version that it was generated
with….
> 
> 
> 
> 
>> On May 22, 2018 at 13:56:23, Simon Elliston Ball (simon@simonellistonball.com) wrote:
>> 
>> Absolutely. I would agree with that as an approach. 
>> 
>> I would also suggest we discuss where schemas and versions should be stored. Atlas?
The NiFi schema repo abstraction (which limits us to Avro to express schema).
>> 
>> What I would like to see would be a change to parser interfaces that emits field
types, ditto the enrichment stages, and then detect changes from that.
>> 
>> The other issue to consider is forward and back compatibility on versions. For example,
if we want to output ORC schema (I really think we should, because the current JSON on HDFS
format is huge and slow), we need to consider the schema output history, since ORC will allow
scheme evolution to an extent (adding fields) but not to others (removing or reordering fields).
This can be resolved by sensible versioning and history aware schema generation.
>> 
>> Simon
>> 
>> 
>>> On 22 May 2018 at 15:23, Otto Fowler <ottobackwards@gmail.com> wrote:
>>> Yes Simon, when I say ‘whatever we would call the complete parse/enrich path’
that is what I was referring to.
>>> 
>>> I would think the flow would be:
>>> 
>>> Save or deploy sensor configurations 
>>> -> check if there is a difference in the configurations from last to new version
>>> -> if there is a difference that effects the ‘schema’ in any configuration
>>> -> build master schema from configurations 
>>> -> version, store, deploy
>>> 
>>> or something.  I’m sure there are things about clean slate deploy vs. new version
deploy.
>>> 
>>>> On May 22, 2018 at 09:59:06, Simon Elliston Ball (simon@simonellistonball.com)
wrote:
>>>> 
>>>> What I would really like to see is not a full end-to-end schema, but units
>>>> that contribute schema. I don't want to see a parser, enrichment, indexing
>>>> config as one package because in any given deployment for any given sensor,
>>>> I may have a different set of enrichments, and so need a different output
>>>> template.
>>>> 
>>>> What I would propose would be parsers and enrichments contribute partial
>>>> schema (potentially expressed as avro, but the important thing is just a
>>>> map of fields to types) which can then be composed, and have the metron
>>>> platform handle creating ES templates / solr schema / Hive Hcat schema /
>>>> A.N.Other index's schema meta data as the composite of those pieces. So,
a
>>>> parser would contribute a set of fields, the fieldTransformations on the
>>>> sensor would contribute some fields, and each enrichment block would
>>>> contribute some fields, at which point we have enough schema definition to
>>>> generate all the required artefacts for whatever storage it ends up in.
>>>> 
>>>> Essentially, composable partial schema units from each component, which add
>>>> up at the end.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> Simon
>>>> 
>>>> 
>>>> On 22 May 2018 at 14:10, Otto Fowler <ottobackwards@gmail.com> wrote:
>>>> 
>>>> > We have discussed in the past as part of 777 ( moment of silence….
) the
>>>> > idea that parsers/sensors ( or whatever we would call the complete
>>>> > parse/enrich path ) could define a their ES or Solr schemas so that
>>>> > they can be ‘installed’ as part of metron and remove the requirement
for a
>>>> > separate install by the system or by the user of a specific index template
>>>> > or equivalent.
>>>> >
>>>> > Nifi has settled on Avro schemas to describe their ‘record’ based
data, and
>>>> > it makes me wonder if we might want to think of using Avro as a universal
>>>> > schema or the base for one such that we can define a schema and apply
it to
>>>> > either ES or Solr.
>>>> >
>>>> > Thoughts?
>>>> >
>>>> 
>>>> 
>>>> 
>>>> --
>>>> --
>>>> simon elliston ball
>>>> @sireb
>> 
>> 
>> 
>> --
>> --
>> simon elliston ball
>> @sireb

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message