metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cestella <>
Subject [GitHub] metron pull request #879: METRON-1378: Create a summarizer
Date Mon, 08 Jan 2018 15:08:07 GMT
GitHub user cestella reopened a pull request:

    METRON-1378: Create a summarizer

    ## Contributor Comments
    We have a nice and generalized infrastructure for loading data into HBase and interacting
with it via `` and `ENRICHMENT_GET()`.  It is also useful to summarize a
set of data into a static data structure, store it on HDFS and interact with it via stellar.
 To this end, to complement the ``, we should have a ``
that, using the same extractor config, will process a flat file and output a serialized object.
    The usecase for this is as follows:
    Let's say that I have a static list of domains in the second column of a CSV, domains.csv,
and I want to generate a bloom filter with those domains in them sans TLD.
    I should be able to create a file called `bloom.ser` with the serialized bloom filter
given the extractor config:
      "config" : {
        "columns" : {
           "rank" : 0,
           "domain" : 1
        "value_transform" : {
           "domain" : "DOMAIN_REMOVE_TLD(domain)"
        "value_filter" : "LENGTH(domain) > 0",
        "state_init" : "BLOOM_INIT()",
        "state_update" : {
           "state" : "BLOOM_ADD(state, domain)"
        "state_merge" : "BLOOM_MERGE(states)",
        "separator" : ","
      "extractor" : "CSV"
    Note, the associated stellar function `OBJECT_GET` is available in #880.
    # Testing Plan
    We should run the test plan for #445 to ensure no regressions since 80% of this PR is
just refactoring existing abstractions to reuse.
    ## Write out a String Locally
    We are going to take the top 10k alexa domains (saved as part of #445 's test plan to
    * Keep a running sample of 20 samples per thread
    * At the end, merge the samples and get a random domain from the merged samples
    * Write out the sample
    ### Test
    * Create a file `~/extractor_sample.json` with the following contents:
      "config" : {
        "columns" : {
           "rank" : 0,
           "domain" : 1
        "value_transform" : {
           "domain" : "DOMAIN_REMOVE_TLD(domain)"
        "value_filter" : "LENGTH(domain) > 0",
        "state_init" : "SAMPLE_INIT(20)",
        "state_update" : {
           "state" : "SAMPLE_ADD(state, domain)"
        "state_merge" : "GET_FIRST(SAMPLE_GET(SAMPLE_MERGE(states, SAMPLE_INIT(1))))",
        "separator" : ","
      "extractor" : "CSV"
    * Summarize via `$METRON_HOME//bin/ -i ~/top-10k.csv -o ~/sample.ser
-e ./extractor_sample.json -p 5 -b 128`
    * Execute `hexdump -C ./sample.ser` and ensure that there is a string in there.  It may
end or start with some non-ascii bytes at the beginning and end.
    [root@node1 ~]# hexdump -C ./sample.ser
    00000000  03 01 37 63 66 6d 6e e6                           |..7cfmn.|
    [root@node1 ~]# cat top-10k.csv | grep 7cfmn
    ### Typosquatting Use-case Testing
    You can also follow the testing plan for #882 as this code is merged into that PR and
it shows how this feature can be used in a real use-case.
    ## Pull Request Checklist
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development Guidelines](
for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification Guidelines](
for complete smoke testing guides.  
    In order to streamline the review of the contribution we ask you follow these guidelines
and ask you to double check the following:
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to be created at
[Metron Jira](

    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are
trying to resolve? Pay particular attention to the hyphen "-" character.
    - [x] Has your PR been rebased against the latest commit within the target branch (typically
    ### For code changes:
    - [x] Have you included steps to reproduce the behavior or problem that is being changed
or addressed?
    - [x] Have you included steps or a guide to how the change may be verified and tested
    - [x] Have you ensured that the full suite of tests and checks have been executed in the
root metron folder via:
      mvn -q clean integration-test install && build_utils/ 
    - [x] Have you written or updated unit tests and or integration tests to verify your changes?
    - [x] If adding new dependencies to the code, are these dependencies licensed in a way
that is compatible for inclusion under [ASF 2.0](

    - [x] Have you verified the basic functionality of the build by building and running locally
with Vagrant full-dev environment or the equivalent?
    ### For documentation related changes:
    - [x] Have you ensured that format looks appropriate for the output in which it is rendered
by building and verifying the site-book? If not then run the following commands and the verify
changes via `site-book/target/site/index.html`:
      cd site-book
      mvn site
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and
submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci]( is set up for your personal
repository such that your branches are built there before submitting a pull request.

You can merge this pull request into a Git repository by running:

    $ git pull flatfile_object_gen

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #879
commit 9c492c4540534fa72550aff330ce6c588f640965
Author: cstella <cestella@...>
Date:   2017-12-21T15:17:18Z

    flatfile summarizer initial commit.

commit 15681143e86913a692777770d0a89e1c877e3d99
Author: cstella <cestella@...>
Date:   2017-12-21T18:50:58Z


commit 935d4d2933e7156219722e54cec5dfce228fdbcc
Author: cstella <cestella@...>
Date:   2017-12-21T21:17:23Z

    Updating tests and docs.

commit afe91c341608468e2637db4a02f9428ebe19353a
Author: cstella <cestella@...>
Date:   2017-12-21T21:18:20Z

    more docs.

commit d955e26cf4e7776642e83b23deb305fd5a238cc2
Author: cstella <cestella@...>
Date:   2017-12-21T21:46:30Z

    Renamed test.

commit ac3c612cd6fd7140a14fac9692000f04b65ecc83
Author: cstella <cestella@...>
Date:   2017-12-22T12:23:04Z

    Adding a ToString writer.

commit 34cdb55f6c43049151c5b5242a73a09119de31ef
Author: cstella <cestella@...>
Date:   2017-12-22T15:10:15Z

    Renamed to console writer

commit b3e4408ab98d69866774bae452e9cc47efc4fbdd
Author: cstella <cestella@...>
Date:   2017-12-22T15:14:43Z

    newline issue.

commit 767e4976a723451c92ff7bbceffafd5c38086c19
Author: cstella <cestella@...>
Date:   2017-12-23T15:32:07Z

    Allowing empty outputs

commit b4e40a4e47ddc6ff871ef0e95b433fb4315f8e34
Author: cstella <cestella@...>
Date:   2017-12-23T16:07:10Z

    Missed a compilation error.

commit 3ed05682372b10aa544f7fbba8a93d7dca78ca25
Author: cstella <cestella@...>
Date:   2018-01-08T14:32:34Z

    Merge branch 'master' into flatfile_object_gen



View raw message