metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mmiklavc <...@git.apache.org>
Subject [GitHub] incubator-metron pull request #445: METRON-706: Add Stellar transformations ...
Date Wed, 08 Feb 2017 00:38:35 GMT
GitHub user mmiklavc opened a pull request:

    https://github.com/apache/incubator-metron/pull/445

    METRON-706: Add Stellar transformations and filters to enrichment and threat intel loaders

    This PR completes work in https://issues.apache.org/jira/browse/METRON-706
    
    (Note: there are commits from @cestella that I had merged in the process of working on
this. They are squashed in master but show up here. They only show in the commit history,
not the diff)
    
    Motivation for this PR is to expand where we expose Stellar capabilities. This work enables
transformations and filtering on enrichment and threatintel extractors. The user is now able
to specify transformation expressions on the column values and separately filter records based
on a provided predicate. The same can also be done independently for the key indicator value
used as part of the HBase key. In addition, a new property has been added to the configuration
that allows a user to specify a Zookeeper quorum and reference global properties specified
in the global config.
    
    See the updated README for documentation details on the new properties.
    
    **Testing**
    
    Testing follows closely with the methods defined in [#432](https://github.com/apache/incubator-metron/pull/432#issuecomment-276733075)
    
    * Download the Alexa top 1m data set
    ```
    wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
    unzip top-1m.csv.zip
    ```
    
    * Stage import file
    ```
    head -n 10000 top-1m.csv > top-10k.csv
    head -n 10 top-1m.csv > top-10.csv
    ```
    
    * Create an extractor.json for the CSV data by editing extractor.json and pasting in these
contents. (Set your zk_quorum to your own value if different from the default Vagrant quick-dev
environment):
    ```
    {
      "config" : {
        "zk_quorum" : "node1:2181",
        "columns" : {
           "rank" : 0,
           "domain" : 1
        },
        "value_transform" : {
           "domain" : "DOMAIN_REMOVE_TLD(domain)",
           "port" : "es.port"
        },
        "value_filter" : "LENGTH(domain) > 0",
        "indicator_column" : "domain",
        "indicator_transform" : {
           "indicator" : "DOMAIN_REMOVE_TLD(indicator)"
        },
        "indicator_filter" : "LENGTH(indicator) > 0",
        "type" : "top_domains",
        "separator" : ","
      },
      "extractor" : "CSV"
    }
    ```
    
    The "port" property/variable here is referencing "es.port" from the global config.
    
    * Run the import (parallelism of 5, batch size of 128)
    ```
    echo "truncate 'enrichment'" | hbase shell && /usr/metron/0.3.0/bin/flatfile_loader.sh
-i ./top-10k.csv -t enrichment -c t -e ./extractor.json -p 5 -b 128 && echo "count
'enrichment'" | hbase shell
    ```
    
    You should see 9275 records in HBase. (Less than the perhaps expected 10k)
    
    * Now run it again on the top-10 set.
    ```
    echo "truncate 'enrichment'" | hbase shell && /usr/metron/0.3.0/bin/flatfile_loader.sh
-i ./top-10.csv -t enrichment -c t -e ./extractor.json -p 5 -b 128 && echo "count
'enrichment'" | hbase shell
    ```
    
    You should get 9 values as below:
    ```
    scan 'enrichment'
    ROW                                                                     COLUMN+CELL
     \x09\x00\x0F,\x10\xE5\xD1\xDE_\xBF\x9E\xA7d\xF2\xA8\x94\x00\x0Btop_dom column=t:v, timestamp=1486513090953,
value={"port":"9300","domain":"yahoo","rank":"5"}
     ains\x00\x05yahoo
     \x11\xCA\xCF\x01\xB4\xC5\x11@\x0C\xA1A,\xE9j~O\x00\x0Btop_domains\x00\ column=t:v, timestamp=1486513090979,
value={"port":"9300","domain":"tmall","rank":"10"}
     x05tmall
     \x13)`\xFC\xF2\xBF\xF9\xC1a\xC8a\xF1h\x0E\xB5\x11\x00\x0Btop_domains\x column=t:v, timestamp=1486513090930,
value={"port":"9300","domain":"youtube","rank":"2"}
     00\x07youtube
     1\xC2I\x05k\xEA\x0EY\xE1\xAD\xA0$U\xA9kc\x00\x0Btop_domains\x00\x06goo column=t:v, timestamp=1486513090964,
value={"port":"9300","domain":"google","rank":"7"}
     gle
     =\xDD\xDFH\x95\xC0\xB9\xD9\xBAKX\x8B\x9B2T\x9F\x00\x0Btop_domains\x00\ column=t:v, timestamp=1486513090942,
value={"port":"9300","domain":"facebook","rank":"3"}
     x08facebook
     D\xDE\x1C\x9A\xCF\x07S\x9A\xDEB\xDB\x87D\x1F\x1D\xF4\x00\x0Btop_domain column=t:v, timestamp=1486513090974,
value={"port":"9300","domain":"qq","rank":"9"}
     s\x00\x02qq
     u\xBC\xFC\xC9\x09\x9Af\xE1\xC8\xA5\x9A\x93\xCB0c\x01\x00\x0Btop_domain column=t:v, timestamp=1486513090970,
value={"port":"9300","domain":"amazon","rank":"8"}
     s\x00\x06amazon
     \xC7\xA5.l\xC21\xFAQ8\x1E\x5C\x99p\x93_\x9A\x00\x0Btop_domains\x00\x09 column=t:v, timestamp=1486513090958,
value={"port":"9300","domain":"wikipedia","rank":"6"}
     wikipedia
     \xCC\xCA\xBF;\x92\xA1\xA0k\xE4\x83i\xBD\xC3\xA8\xE8p\x00\x0Btop_domain column=t:v, timestamp=1486513090948,
value={"port":"9300","domain":"baidu","rank":"4"}
     s\x00\x05baidu
    ```
    
    Once again, we get fewer than the original dataset size. This is because multiple records
are mapping to the same resulting keys in HBase.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mmiklavc/incubator-metron top-domains-merge

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #445
    
----
commit 64a2fc6ee1190776bcbb46ecf6841b58ce2bf311
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-01-25T21:38:08Z

    save some work and notes

commit a6a6ab64e2777610ff57727195d3ce0d2c2c8cb1
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-01-27T14:25:54Z

    Extraction done

commit 47d814ef95d67738d20ce5dc530ba7b05d418a96
Author: cstella <cestella@gmail.com>
Date:   2017-01-27T23:15:44Z

    Multithreading the SimpleEnrichmentFlatFileLoader

commit 918d4ce4aea5d7dfde992f32bf049c70f35dd182
Author: cstella <cestella@gmail.com>
Date:   2017-01-27T23:23:19Z

    doc changes.

commit c6ca3a86881eb77bc9598a61e3c0cf8280ccb03f
Author: cstella <cestella@gmail.com>
Date:   2017-01-27T23:39:56Z

    Updating docs.

commit 8c9a79cdfa38ea2fbd161095d5e346147558ec5f
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T03:36:31Z

    Investigating integration tests.

commit 315bd181aa634290ab987441d81c28addb7952e2
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T04:09:28Z

    Update integration test to be a proper integration test.

commit 004c6f41b6c1cc3ecea70513e1a468501bd32e3c
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T04:49:37Z

    Adding spliterator unit test for completeness

commit f8dd48ef920c948e1fc5ff736e386f641e551b2b
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T05:01:42Z

    Updating test to use a proper file

commit 9b04f9723d442c8f4fb7a8bcaa1d733fc1305dc4
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T05:17:12Z

    Updating docs and renaming a few things.

commit eb5b82cc35bd767a169f548ea8144dd9ae165f84
Author: cstella <cestella@gmail.com>
Date:   2017-01-28T05:23:25Z

    Update one more test case.

commit 81c42afa2ff619ca23bfa5ec546c94ee8d6063e5
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-01-30T16:09:52Z

    partial commit - adding additional filter and transform for indicator

commit 310c98bd946b2fdb320193cce85d368f016bf8c3
Author: cstella <cestella@gmail.com>
Date:   2017-01-30T20:36:23Z

    Merge branch 'master' into unified_loader

commit 3f6e3ba4f30e41c94ff25027f1fd7c839ea6c9bf
Author: cstella <cestella@gmail.com>
Date:   2017-01-31T15:39:03Z

    Updating simple enrichment flat file loader to be complete.

commit 2bdaf419621704970159e75e202acfeb868c3571
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-01-31T20:16:10Z

    Merge branch 'master' into top-domains

commit 79cfdb4fba5e82e9e170bfc77c7133e6646f9787
Author: cstella <cestella@gmail.com>
Date:   2017-01-31T22:12:05Z

    Removing old threatintel_bulk_load.sh script and integrating into the flatfile load script

commit bf7756b52e66907ca23a576ba9be9ab40b33f77d
Author: cstella <cestella@gmail.com>
Date:   2017-01-31T22:22:17Z

    Forgot licenses.

commit e5729a296bdbef6d2d3ee87c69aade396708f47d
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-01T00:16:06Z

    Merge with master. Get indicator transforms and filter working

commit a104f464e6b882121c7ab44079a5570d282c8457
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T00:28:46Z

    updating script.

commit b121e13d892834865847ddd806cbf10da63fa44e
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T00:34:28Z

    Merge branch 'master' into unified_loader

commit b5a9e5a9243576b27d59e959dfab3e99d34eb761
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T00:57:02Z

    Added gzip and zip to regular files

commit 323267ddfb52ab1aa7488e02643a8158044797e2
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T15:04:53Z

    Fixed stupid zip issue.

commit bc26b5b3992b91097bb4fc4b214d4b6bacaddfbb
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T16:27:58Z

    Updating readme and making progress bar optional and better.

commit 6cdf35d94f72be7da524fd5f854876f131ddb9f9
Author: cstella <cestella@gmail.com>
Date:   2017-02-01T17:39:59Z

    updating tests to include gzip and zip

commit fd718bffa5e97f2c5c510b38d6a6d3812aefbed9
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-01T18:57:04Z

    Refactor

commit d24f0c974d27e3861cb431c48efb3380a372e58b
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-02T19:03:56Z

    Get unit test for extractor decorator working

commit d9bb54ec27a0f3282d28ba40d043f0045c167a54
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-02T21:47:08Z

    Add negative test cases. Refactor options as enum in extractor decorator

commit 43c09c810c7d7cfa05cffa4609edab7ba2f24492
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-03T18:10:34Z

    Intermediate commit - need to fetch from PR432

commit eafc786250d9b8e6283bd71c91bbd270ba4d1311
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-03T18:52:03Z

    Get integration tests for flat file loader working with my branch. Fix trampled commit
for ExtractorHandler

commit ad1aef760948109565b7144479151312ebccc24d
Author: Michael Miklavcic <michael.miklavcic@gmail.com>
Date:   2017-02-03T19:46:05Z

    Get integration tests working for Stellar transformations in the file loader

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message