metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cestella <...@git.apache.org>
Subject [GitHub] metron pull request #879: METRON-1378: Create a summarizer
Date Mon, 08 Jan 2018 19:54:13 GMT
Github user cestella commented on a diff in the pull request:

    https://github.com/apache/metron/pull/879#discussion_r160239982
  
    --- Diff: metron-platform/metron-data-management/README.md ---
    @@ -354,3 +357,91 @@ The parameters for the utility are as follows:
     | -r         | --remote_dir        | No           | HDFS directory to land formatted
GeoIP file - defaults to /apps/metron/geo/\<epoch millis\>/     |
     | -t         | --tmp_dir           | No           | Directory for landing the temporary
GeoIP data - defaults to /tmp                                |
     | -z         | --zk_quorum         | Yes          | Zookeeper Quorum URL (zk1:port,zk2:port,...)
                                                    |
    +
    +### Flatfile Summarizer
    +
    +The shell script `$METRON_HOME/bin/flatfile_summarizer.sh` will read data from local
disk, HDFS or URLs and generate a summary object.
    +The object will be serialized and written to disk, either HDFS or local disk depending
on the output mode specified.
    +
    +It should be noted that this utility uses the same extractor config as the `flatfile_loader.sh`,
    +but as the output target is not a key value store (but rather a summary object), it is
not necessary
    +to specify certain configs:
    +* `indicator`, `indicator_filter` and `indicator_transform` are not required, but will
be executed if present.
    +As in the loader, there will be an indicator field available if you so specify it (by
using `indicator` in the config).
    +* `type` is neither required nor used
    +
    +Indeed, some new configs are expected:
    +* `state_init` : Executed once to initialize the state object (the object written out).
    +* `state_update`: Called once per message.  The fields available are the fields for the
row as well as
    +  * `indicator` - the indicator value if you've specified it in the config
    +  * `state` - the current state.  Useful for adding to the state (e.g. `BLOOM_ADD(state,
val)` where `val` is the name of a field).
    +* `state_merge` : If you are running this multi-threaded and your objects can be merged,
this is the statement that will
    +merge the state objects created per thread.  There is a special field available to this
config:
    +  * `states` - a list of the state objects
    +
    +One special thing to note here is that there is a special configuration
    +parameter to the Extractor config that is only considered during this
    +loader:
    +* inputFormat : This specifies how to consider the data.  The two implementations are
`BY_LINE` and `WHOLE_FILE`.
    +
    +The default is `BY_LINE`, which makes sense for a list of CSVs where
    +each line indicates a unit of information which can be imported.
    +However, if you are importing a set of STIX documents, then you want
    +each document to be considered as input to the Extractor.
    +
    +#### Example
    +
    +Consider the possibility that you want to generate a bloom filter with all of the domains
in a CSV structured similarly to
    +the Alexa top 1M domains, so the columns are:
    +* rank
    +* domain name
    +
    +You want to generate a bloom filter with just the domains, not considering the TLD.
    +You would execute the following to:
    +* read data from `./top-1m.csv`
    +* write data to `./filter.ser`
    +* use 5 threads
    +
    +```
    +$METRON_HOME/bin/flatfile_summarizer.sh -i ./top-1m.csv -o ./filter.ser -e ./extractor.json
-p 5 -b 128
    +```
    +
    +To configure this, `extractor.json` would look like:
    +```
    +{
    +  "config" : {
    +    "columns" : {
    +      "rank" : 0,
    +      "domain" : 1
    +    },
    +    "value_transform" : {
    +      "domain" : "DOMAIN_REMOVE_TLD(domain)"
    +    },
    +    "value_filter" : "LENGTH(domain) > 0",
    +    "state_init" : "BLOOM_INIT()",
    +    "state_update" : {
    +      "state" : "BLOOM_ADD(state, domain)"
    +    },
    +    "state_merge" : "BLOOM_MERGE(states)",
    +    "separator" : ","
    +  },
    +  "extractor" : "CSV"
    +}
    +```
    +
    +#### Parameters
    +
    +The parameters for the utility are as follows:
    +
    +| Short Code | Long Code           | Is Required? | Description                     
                                                                                         
                                                         |
    +|------------|---------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
    +| -h         |                     | No           | Generate the help screen/set of options
                                                                                         
                                                  |
    +| -q         | --quiet             | No           | Do not update progress          
                                                                                         
                                                         |
    +| -e         | --extractor_config  | Yes          | JSON Document describing the extractor
for this input data source                                                               
                                                   |
    +| -m         | --import_mode       | No           | The Import mode to use: LOCAL, MR.
 Default: LOCAL                                                                          
                                                       |
    +| -om        | --output_mode       | No           | The Output mode to use: LOCAL, HDFS.
 Default: LOCAL                                                                          
                                                       |
    +| -i         | --input             | Yes          | The input data location on local
disk.  If this is a file, then that file will be loaded.  If this is a directory, then the
files will be loaded recursively under that directory.  |
    +| -i         | --output            | Yes          | The output data location.    |
    --- End diff --
    
    good catch


---

Mime
View raw message