metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From justinleet <>
Subject [GitHub] incubator-metron pull request #421: METRON-283 Migrate Geo Enrichment outsid...
Date Fri, 20 Jan 2017 13:29:48 GMT
GitHub user justinleet opened a pull request:

    METRON-283 Migrate Geo Enrichment outside of MySQL

    ## MySQL Removed
    Drops MySQL entirely from the project.  This is done for a couple reasons outlined in
a discussion thread on the dev lists.  They boil down to a combination of licensing, eliminating
a single point of failure, and using MaxMind's libraries for handling GeoLite2 data, and a
couple other concerns.
    This PR includes dependencies, installation paths, READMEs, etc.  The only places left
are MySQLConfig and MySQLConfigTest for if anyone wants to use them.  The vast majority of
removed files / code are simply from stripping this out.  If any traces of MySQL outside of
this are found in review, they should almost certainly be removed.
    This moves to a system based on using [MaxMind's binary DB](
directly. Per their page:
    > The GeoLite2 databases are distributed under the Creative Commons Attribution-ShareAlike
4.0 International License. 
    Our LICENSE file has been updated with the notification that we include use their database
(and some of their [test data](, which is CCA ShareAlike
3.0).  Both of these licenses are acceptable for us as stated on [Apache Legal](
    Let me know if that notification should be in a different spot or spots, and I can adjust
it appropriately.
    ## GeoLite2 Database
    The main portion of the PR is in ``, which manages access to the
GeoLite2 database.
    ### Raw Database
    The raw database is stored on HDFS.  By default, it will be in `/apps/metron/geo`.  If
no explicit location is given, `/apps/metron/geo/default/<db_filename>` will be used.
 Otherwise, updates will use `/apps/metron/geo/<millis>/<db_filename>`.  Given
the low rate of churn on the DB (updated once per week) and the potential for replay use cases
, I haven't implemented any pruning or anything fancy on top of this.
    #### Updating DB
    A script is provided for updates `` in `metron-data-management`.
 Usage details are provided in `metron-data-management/`.  Note that the original
didn't appear to have update capabilities, 
    The script will pull down a new instance of GeoLite2 database.  This location can be either
their standard web address (or somewhere else hosted), or even a file:// URL.  Once the db
file is pulled down, it will push to the appropriate HDFS location.  Finally, it will pull
down and update the global config with the new location.  This will not require a topology
    Note that there have been conversations about how we manage config updates (specifically
leaning towards Ambari).  This has not been finalized, and we have two non Ambari testing
environments (quickdev and docker-metron) so this just hits ZK.  Ambari is not updated based
on this script, and it is the user's responsibility to update global.json.
    This leads to a questions people may have preferences on
    - Do we want the script to always update? Should there be a flag to stage the file, but
not update configs?
    ### Code
    It is a singleton that allows for the database to be updated when a global config is updated.
 It is (hopefully!) correctly locked to avoid threading issues when updating or reading from
the DB (and I've been able to update without issues.
    The various Bolts have been updated to make sure they initialize the adapter to have it
grab the current data appropriately.
    In addition, a Stellar function has been provided GEO_GET(), which takes an IPV4 address.
 It probably works with an IPV6 address, but I didn't really dig into it, given that the goal
was to initially match parity.
    Given the somewhat core nature of this, and my relative unfamiliarity going in with how
all these pieces tie together, I'm definitely looking for feedback on how things are implemented,
or if I missed conventions we've used in the code.
    ## Testing
    Unit testing is added for the database and Stellar portions of the code as needed.  The
DB testing uses one of MaxMind's test DB's that they've published, because we can't create
the binary format correctly.  It does not use the full (20+ MB) version of the data, but rather
a stripped down version (on the order of several KB).
    Three environments were tested during this.  Having these three disparate environments
make features that cut across like this more complicated to test, so additional scrutiny would
be merited (I would definitely like at least one person to run through one of these themselves
and make sure it's transparent).  Notably, quickdev requires Ansible setup scripts to align;
the mpack requires layout, internal configuration, and handling of additional files and ownership
of scripts to work properly; and docker-metron requires essentially cheating the scripts and
just running a wget on the file because things aren't actually setup.
    - quickdev
    - Ambari Management Pack
    - docker-metron
    ### QuickDev
    Ansible scripts are updated.  Running data through topologies kicked out the data.
    ### Ambari Management Pack
    RPMs updated where needed. Config Screen layout changed, updates made to properly handle
configs and ownership.  Ran Stellar on this install.  Again, ran data through the topology.
    ### Docker
    Essentially this just involved cheating the scripts and running a wget on the GeoLite2
dbfile, because there's no Hadoop.  Ran through the instructions to run the topologies (which
are a little different than the others because Docker) and again was able to get data back
    ## Additional Notes
    - As noted above, do we want the DB script to always update? Should there be a flag to
stage the file, but not update configs?  I primarily see this affecting the mpack because
of the Ambari management behind it.
    - Increased `withMaxTimeMS` in the indexing integration test.  This seems unrelated to
my changes ( and I believe had been seen elsewhere), so if anybody has found the root cause,
I can adjust my code appropriately.
    - LocID doesn't technically exist in the new data, and I suspect it was never meant to
be relied upon anyway outside of being a join key.  The same applies to the new field that
is replacing it in this context.  It seems like we were mostly just passing that field along
because it was available, and it seems like it should be refactored to be more useful.  I
didn't take on that analysis here, it's the slightly more validated version of a gut feeling.
    - The newer form of the MaxMind info has more data available than the old source we were
using.  We should also consider passing (at least some) of this data along. See MaxMind's
[What's New in GeoIp2](  One of
the ones that leapt out at me as potentially interesting was a field containing where an IP
was registered, rather than just where the IP actually is.  Another is fields for `is_anonymous_proxy`
key, etc.  I didn't validate if everything new is in the free version of the dataset.

You can merge this pull request into a Git repository by running:

    $ git pull geo_mmdb

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #421
commit fe5e6b87e18ed08f6136bb59feb525ba50b978cd
Author: justinjleet <>
Date:   2016-12-22T17:25:32Z

    Drop MySQL and use the GeoLite2 databases instead

commit b907d26f04fa5d8f4af50ecd1b8850aab6241e77
Author: justinjleet <>
Date:   2017-01-18T13:04:37Z

    Updating unit test

commit 1b434ba8942e56f43ac05027c9bdd920ce135672
Author: justinjleet <>
Date:   2017-01-19T04:00:38Z

    Update with MaxMind license

commit dd14dbf397458f26b62ec6638c0a39d06756b15c
Author: justinjleet <>
Date:   2017-01-19T06:57:41Z

    geo url fix

commit 09b33b75a3690269aa88609a96f1ef7d7ea6112f
Author: justinjleet <>
Date:   2017-01-19T07:07:17Z

    fixing docker

commit c9cbb23ebaef0a03811d789af52448662db95cb0
Author: justinjleet <>
Date:   2017-01-19T07:08:39Z

    fixing Ansible after adjusting default path

commit 859106a7fffbbb47c86ea2d55b2a3600cc4b595d
Author: justinjleet <>
Date:   2017-01-19T13:58:45Z

    Fixing metron-docker

commit b6bfc16cf2776272491914f0da55ea19a74c4006
Author: justinjleet <>
Date:   2017-01-19T14:05:47Z

    Update docs

commit 6760c3e95cbefd97098b39ac385edfbf36633dc4
Author: justinjleet <>
Date:   2017-01-19T14:13:47Z

    updating stellar function and readme

commit 46e50bb7b5789811fa0e3545dbd4c3c7b480d079
Author: justinjleet <>
Date:   2017-01-19T14:18:42Z

    Adding a couple unit tests and cleaning up Stellar function results

commit fc856f6b88bdb7191dd0f8fa7d7fd136433ea22e
Author: justinjleet <>
Date:   2017-01-19T14:23:02Z

    Updating Stellar docs

commit 6df268d0e5313435b6c320b7cbd1ec180bf92c92
Author: justinjleet <>
Date:   2017-01-19T20:24:36Z

    Adding note to readme about script interaction with Ambari


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

View raw message