metron-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (METRON-1795) General Purpose Regex Parser
Date Fri, 07 Dec 2018 00:40:00 GMT

    [ https://issues.apache.org/jira/browse/METRON-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712178#comment-16712178
] 

ASF GitHub Bot commented on METRON-1795:
----------------------------------------

Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r239664781
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using `wrapInEntityArray`.
 The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, then current
time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields from a message
part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, this parser
will automatically convert all the camel case property names to underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not support underscores
in the named group names. So in case your property naming conventions requires underscores
in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular expression
mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": "(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z]
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": ".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
       (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be specified as lists
also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may have named
    +      groups, which would be extracted into fields. This list will be evaluated in order
until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression containing named
goups. For example
    --- End diff --
    
    Thanks. I will update the documentation.


> General Purpose Regex Parser
> ----------------------------
>
>                 Key: METRON-1795
>                 URL: https://issues.apache.org/jira/browse/METRON-1795
>             Project: Metron
>          Issue Type: New Feature
>            Reporter: Jagdeep Singh
>            Priority: Minor
>
> We have implemented a general purpose regex parser for Metron that we are interested
in contributing back to the community.
>  
> While the Metron Grok parser provides some regex based capability today, the intention
of this general purpose regex parser is to:
>  # Allow for more advanced parsing scenarios (specifically, dealing with multiple regex
lines for devices that contain several log formats within them)
>  # Give users and developers of Metron additional options for parsing
>  # With the new parser chaining and regex routing feature available in Metron, this gives
some additional flexibility to logically separate a flow by:
>  # Regex routing to segregate logs at a device level and handle envelope unwrapping
>  # This general purpose regex parser to parse an entire device type that contains multiple
log formats within the single device (for example, RHEL logs)
> At the high-level control flow is like this:
>  # Identify the record type if incoming raw message.
>  # Find and apply the regular expression of corresponding record type to extract the
fields (using named groups). 
>  # Apply the message header regex to extract the fields in the header part of the message (using
named groups).
>  
> The parser config uses the following structure:
>   
> {code:java}
> "recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"  
>  "messageHeaderRegex": "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))",
>    "fields": [
>       {
>         "recordType": "kernel",
>         "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))"
>       },
>       {
>         "recordType": "syslog",
>         "regex": ".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))"
>       }
> ]
> {code}
>  
> Where:
>  * *recordTypeRegex* is used to distinctly identify a record type. It inputs a valid
regular expression and may also have named groups, which would be extracted into fields.
>  * *messageHeaderRegex* is used to specify a regular expression to extract fields from
a message part which is common across all the messages (i.e, syslog fields, standard headers)
>  * *fields*: json list of objects containing recordType and regex. The expression that
is evaluated is based on the output of the recordTypeRegex
>  * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as lists also
(as a JSON array), where the list will be evaluated in order until a matching regular expression
is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message