metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 17:00:28 GMT
I like that approach even more. That way we would only have to worry about distributing the
database file in binary format to all the supervisor nodes on update.

It would also make it easier for people to switch to the enterprise DB potentially if they
had the license. 

One slight issue with this might be for people who wanted to extend the database. For example,
organisations may want to add geo-enrichment to their own private network addresses based
modified versions of the geo database. Currently we don’t really allow this, since we hard-code
ignoring private network classes into the geo enrichment adapter, but I can see a case where
a global org might want to add their own ranges and locations to the data set. Does that make
sense to anyone else?

Simon


> On 16 Jan 2017, at 16:50, JJ Meyer <jjmeyer0@gmail.com> wrote:
> 
> Hello all,
> 
> Can we leverage maxmind's Java client (
> https://github.com/maxmind/GeoIP2-java/tree/master/src/main/java/com/maxmind/geoip2)
> in this case? I believe it can directly read maxmind file. Plus I think it
> also has some support for caching as well.
> 
> Thanks,
> JJ
> 
> On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> simon@simonellistonball.com> wrote:
> 
>> I like the idea of MapDB, since we can essentially pull an instance into
>> each supervisor, so it makes a lot of sense for relatively small scale,
>> relatively static enrichments in general.
>> 
>> Generally this feels like a caching problem, and would be for a simple
>> key-value lookup. In that case I would agree with David Lyle on using HBase
>> as a source or truth and relying on caching.
>> 
>> That said, GeoIP is a different lookup pattern, since it’s a range lookup
>> then a key lookup (or if we denormalize the MaxMind data, just a range
>> lookup) for that kind of thing MapDB with something like the BTree seems a
>> good fit.
>> 
>> Simon
>> 
>> 
>>> On 16 Jan 2017, at 16:28, David Lyle <dlyle65535@gmail.com> wrote:
>>> 
>>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
>>> HBase enrichment. If our current caching isn't enough to mitigate the
>> above
>>> issues, we have a problem, don't we? Or do we not recommend HBase
>>> enrichment for per message enrichment in general?
>>> 
>>> Also- can you elaborate on how MapDB would not require a network hop?
>>> Doesn't this mean we would have to sync the enrichment data to each Storm
>>> supervisor? HDFS could (probably would) have a network hop too, no?
>>> 
>>> Fwiw -
>>> "In its place, I've looked at using MapDB, which is a really easy to use
>>> library for creating Java collections backed by a file (This is NOT a
>>> separate installation of anything, it's just a jar that manages
>> interaction
>>> with the file system).  Given the slow churn of the GeoIP files (I
>> believe
>>> they get updated once a week), we can have a script that can be run when
>>> needed, downloads the MaxMind tar file, builds the MapDB file that will
>> be
>>> used by the bolts, and places it into HDFS.  Finally, we update a config
>> to
>>> point to the new file, the bolts get the updated config callback and can
>>> update their db files.  Inside the code, we wrap the MapDB portions to
>> make
>>> it transparent to downstream code."
>>> 
>>> Seems a bit more complex than "refresh the hbase table". Afaik, either
>>> approach would require some sort of translation between GeoIP source
>> format
>>> and target format, so that part is a wash imo.
>>> 
>>> So, I'd really like to see, at least, an attempt to leverage HBase
>>> enrichment.
>>> 
>>> -D...
>>> 
>>> 
>>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <cestella@gmail.com>
>> wrote:
>>> 
>>>> I think that it's a sensible thing to use MapDB for the geo enrichment.
>>>> Let me state my reasoning:
>>>> 
>>>>  - An HBase implementation  would necessitate a HBase scan possibly
>>>>  hitting HDFS, which is expensive per-message.
>>>>  - An HBase implementation would necessitate a network hop and MapDB
>>>>  would not.
>>>> 
>>>> I also think this might be the beginning of a more general purpose
>> support
>>>> in Stellar for locally shipped, read-only MapDB lookups, which might be
>>>> interesting.
>>>> 
>>>> In short, all quotes about premature optimization are sure to apply to
>> my
>>>> reasoning, but I can't help but have my spidey senses tingle when we
>>>> introduce a scan-per-message architecture.
>>>> 
>>>> Casey
>>>> 
>>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
>> Dima.Kovalyov@sstech.us>
>>>> wrote:
>>>> 
>>>>> Hello Justin,
>>>>> 
>>>>> Considering that Metron uses hbase tables for storing enrichment and
>>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
>>>>> Or MapDB can be used for enrichment and threatintel feeds instead of
>>>> hbase?
>>>>> 
>>>>> - Dima
>>>>> 
>>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> As a bit of background, right now, GeoIP data is loaded into and
>>>> managed
>>>>> by
>>>>>> MySQL (the connectors are LGPL licensed and we need to sever our
Maven
>>>>>> dependency on it before next release). We currently depend on and
>>>> install
>>>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
>>>> Docker
>>>>>> installs). In the topology, we use the JDBCAdapter to connect to
MySQL
>>>>> and
>>>>>> query for a given IP.  Additionally, it's a single point of failure
>> for
>>>>>> that particular enrichment right now.  If MySQL is down, geo
>> enrichment
>>>>>> can't occur.
>>>>>> 
>>>>>> I'm proposing that we eliminate the use of MySQL entirely, through
all
>>>>>> installation paths (which, unless I missed some, includes Ansible,
the
>>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all
the
>>>>>> various MySQL setup and management through the code, along with all
>> the
>>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants
to
>>>> setup
>>>>>> their own databases for enrichments and install connectors is able
to
>>>> do
>>>>> so.
>>>>>> 
>>>>>> In its place, I've looked at using MapDB, which is a really easy
to
>> use
>>>>>> library for creating Java collections backed by a file (This is NOT
a
>>>>>> separate installation of anything, it's just a jar that manages
>>>>> interaction
>>>>>> with the file system).  Given the slow churn of the GeoIP files (I
>>>>> believe
>>>>>> they get updated once a week), we can have a script that can be run
>>>> when
>>>>>> needed, downloads the MaxMind tar file, builds the MapDB file that
>> will
>>>>> be
>>>>>> used by the bolts, and places it into HDFS.  Finally, we update a
>>>> config
>>>>> to
>>>>>> point to the new file, the bolts get the updated config callback
and
>>>> can
>>>>>> update their db files.  Inside the code, we wrap the MapDB portions
to
>>>>> make
>>>>>> it transparent to downstream code.
>>>>>> 
>>>>>> The particularly nice parts about using MapDB are that its ease of
use
>>>>> plus
>>>>>> it offers the utilities we need out of the box to be able to support
>>>> the
>>>>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
>>>>> and
>>>>>> we need to be able to easily grab the appropriate range).
>>>>>> 
>>>>>> The main point of concern I have about this is that when we grab
the
>>>> HDFS
>>>>>> file during an update, given that multiple JVMs can be running, we
>>>> don't
>>>>>> want them to clobber each other. I believe this can be avoided by
>>>> simply
>>>>>> using each worker's working directory to store the file (and
>>>>> appropriately
>>>>>> ensure threads on the same JVM manage multithreading).  This should
>>>> keep
>>>>>> the JVMs (and the underlying DB files) entirely independent.
>>>>>> 
>>>>>> This script would get called by the various installations during
>>>> startup
>>>>> to
>>>>>> do the initial setup.  After install, it can then be called on demand
>>>> in
>>>>>> order.
>>>>>> 
>>>>>> At this point, we should be all set, with everything running and
>>>>> updatable.
>>>>>> 
>>>>>> Justin
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message