metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From justinleet <...@git.apache.org>
Subject [GitHub] metron pull request #609: METRON-987: Allow stellar enrichments to be specif...
Date Thu, 08 Jun 2017 15:16:35 GMT
Github user justinleet commented on a diff in the pull request:

    https://github.com/apache/metron/pull/609#discussion_r120917140
  
    --- Diff: metron-platform/metron-enrichment/README.md ---
    @@ -71,40 +73,94 @@ The `fieldMap`contents are of interest because they contain the routing
and conf
           ]
           }
     ```
    -Based on this sample config, both ip_src_addr and ip_dst_addr will go to the `geo`, `host`,
and `hbaseEnrichment` adapter bolts. For the `geo`, `host` and `hbaseEnrichment`, this is
sufficient.  However, more complex enrichments may contain their own configuration.  Currently,
the `stellar` enrichment requires a more complex configuration, such as:
    +Based on this sample config, both `ip_src_addr` and `ip_dst_addr` will go to the `geo`,
`host`, and 
    +`hbaseEnrichment` adapter bolts. 
    + 
    +#### Stellar Enrichment Configuration
    +For the `geo`, `host` and `hbaseEnrichment`, this is sufficient. However, more complex
enrichments 
    +may contain their own configuration.  Currently, the `stellar` enrichment is more adaptable
and thus
    +requires a more nuanced configuration.
    +
    +At its most basic, we want to take a message and apply a couple of enrichments, such
as converting the
    +`hostname` field to lowercase. We do this by specifying the transformation inside of
the 
    +`config` for the `stellar` fieldMap.  There are two syntaxes that are supported, specifying
the transformations
    +as a map with the key as the field and the value the stellar expression:
     ```
         "fieldMap": {
            ...
           "stellar" : {
             "config" : {
    -          "numeric" : {
    -                      "foo": "1 + 1"
    -                      }
    -          ,"ALL_CAPS" : "TO_UPPER(source.type)"
    +          "hostname" : "TO_LOWER(hostname)"
             }
           }
         }
     ```
     
    -Whereas the simpler enrichments just need a set of fields explicitly stated so they can
be separated from the message and sent to the enrichment adapter bolt for enrichment and ultimately
joined back in the join bolt, the stellar enrichment has its set of required fields implicitly
stated through usage.  For instance, if your stellar statement references a field, it should
be included and if not, then it should not be included.  We did not want to require users
to make explicit the implicit.
    +Another approach is to make the transformations as a list with the same `var := expr`
syntax as is used
    +in the Stellar REPL, such as:
    +```
    +    "fieldMap": {
    +       ...
    +      "stellar" : {
    +        "config" : [
    +          "hostname := TO_LOWER(hostname)"
    +        ]
    +      }
    +    }
    +```
    +
    +Sometimes arbitrary stellar enrichments may take enough time that you would prefer to
split some of them
    +into groups and execute the groups of stellar enrichments in parallel.  Take, for instance,
if you wanted
    +to do an HBase enrichment and a profiler call which were independent of one another.
 This usecase is 
    +supported by splitting the enrichments up as groups.
     
    -The other way in which the stellar enrichment is somewhat more complex is in how the
statements are executed.  In the general purpose case for a list of fields, those fields are
used to create a message to send to the enrichment adapter bolt and that bolt's worker will
handle the fields one by one in serial for a given message.  For stellar enrichment, we wanted
to have a more complex design so that users could specify the groups of stellar statements
sent to the same worker in the same message (and thus executed sequentially).  Consider the
following configuration:
    +Consider the following example:
     ```
         "fieldMap": {
    +       ...
           "stellar" : {
             "config" : {
    -          "numeric" : {
    -                      "foo": "1 + 1"
    -                      "bar" : TO_LOWER(source.type)"
    -                      }
    -         ,"text" : {
    -                   "ALL_CAPS" : "TO_UPPER(source.type)"
    -                   }
    +          "malicious_domain_enrichment" : {
    +            "is_bad_domain" : "ENRICHMENT_EXISTS('malicious_domains', ip_dst_addr, 'enrichments',
'cf')"
    +          },
    +          "login_profile" : [
    +            "profile_window := PROFILE_WINDOW('from 6 months ago')", 
    +            "global_login_profile := PROFILE_GET('distinct_login_attempts', 'global',
profile_window)",
    +            "stats := STATS_MERGE(global_login_profile)",
    +            "auth_attempts_median := STATS_PERCENTILE(stats, 0.5)", 
    +            "auth_attempts_sd := STATS_SD(stats)",
    +            "profile_window := null", 
    +            "global_login_profile := null", 
    +            "stats := null"
    +          ]
             }
           }
         }
     ```
    -We have a group called `numeric` whose stellar statements will be executed sequentially.
 In parallel to that, we have the group of stellar statements under the group `text` executing.
 The intent here is to allow you to not force higher latency operations to be done sequentially.
You can use any name for your groupings you like. Be aware that the configuration is a map
and duplicate configuration keys' values are not combined, so the duplicate configuration
value will be overwritten.
    +
    +Here we want to perform two enrichments that hit HBase and we would rather not run in
sequence.  These
    +enrichments are entirely independent of one another (i.e. neither relies on the output
of the other).  In
    +this case, we've created a group called `malicious_domain_enrichment` to inquire about
whether the destination
    +address exists in the HBase enrichment table in the `malicious_domains` enrichment type.
 This is a simple
    +enrichment, so we can express the enrichment group as a map with the new field `is_bad_domain`
being a key
    +and the stellar expression associated with that operation being the associated value.
    +
    +In contrast, the stellar enrichment group `login_profile` is interacting with the profiler,
has multiple temporary
    +expressions (i.e. `profile_window`, `global_login_profile`, and `stats`) that are useful
only within the context
    +of this group of stellar expressions.  In this case, we would need to ensure that we
use the list construct
    +when specifying the group and remember to set the temporary variables to `null` so they
are not passed along.
    +
    +In general, things to note from this section are as follows:
    +* The stellar enrichments for the `stellar` enrichment adapter are specified in the `config`
for the `stellar` enrichment
    +adapter in the `fieldMap`
    +* Groups of independent (i.e. no expression in any group depend on the output of an expression
from an other group) may be executed in parallel
    +* If you have the need to use temporary variables, you may use the list construct.  Ensure
that you assign the variables to `null` before the end of the group.
    +* **Ensure that you do not assign a field to a stellar expression which returns an object
which JSON cannot represent.**
    --- End diff --
    
    I don't think it's in scope for this, but it feels like this should be enforceable at
an early time.  It would be nice to be able to do some high level checks like this before
allowing things like this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message