metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Elliston Ball <si...@simonellistonball.com>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 16:32:12 GMT
I like the idea of MapDB, since we can essentially pull an instance into each supervisor, so
it makes a lot of sense for relatively small scale, relatively static enrichments in general.


Generally this feels like a caching problem, and would be for a simple key-value lookup. In
that case I would agree with David Lyle on using HBase as a source or truth and relying on
caching. 

That said, GeoIP is a different lookup pattern, since it’s a range lookup then a key lookup
(or if we denormalize the MaxMind data, just a range lookup) for that kind of thing MapDB
with something like the BTree seems a good fit. 

Simon


> On 16 Jan 2017, at 16:28, David Lyle <dlyle65535@gmail.com> wrote:
> 
> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
> HBase enrichment. If our current caching isn't enough to mitigate the above
> issues, we have a problem, don't we? Or do we not recommend HBase
> enrichment for per message enrichment in general?
> 
> Also- can you elaborate on how MapDB would not require a network hop?
> Doesn't this mean we would have to sync the enrichment data to each Storm
> supervisor? HDFS could (probably would) have a network hop too, no?
> 
> Fwiw -
> "In its place, I've looked at using MapDB, which is a really easy to use
> library for creating Java collections backed by a file (This is NOT a
> separate installation of anything, it's just a jar that manages interaction
> with the file system).  Given the slow churn of the GeoIP files (I believe
> they get updated once a week), we can have a script that can be run when
> needed, downloads the MaxMind tar file, builds the MapDB file that will be
> used by the bolts, and places it into HDFS.  Finally, we update a config to
> point to the new file, the bolts get the updated config callback and can
> update their db files.  Inside the code, we wrap the MapDB portions to make
> it transparent to downstream code."
> 
> Seems a bit more complex than "refresh the hbase table". Afaik, either
> approach would require some sort of translation between GeoIP source format
> and target format, so that part is a wash imo.
> 
> So, I'd really like to see, at least, an attempt to leverage HBase
> enrichment.
> 
> -D...
> 
> 
> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <cestella@gmail.com> wrote:
> 
>> I think that it's a sensible thing to use MapDB for the geo enrichment.
>> Let me state my reasoning:
>> 
>>   - An HBase implementation  would necessitate a HBase scan possibly
>>   hitting HDFS, which is expensive per-message.
>>   - An HBase implementation would necessitate a network hop and MapDB
>>   would not.
>> 
>> I also think this might be the beginning of a more general purpose support
>> in Stellar for locally shipped, read-only MapDB lookups, which might be
>> interesting.
>> 
>> In short, all quotes about premature optimization are sure to apply to my
>> reasoning, but I can't help but have my spidey senses tingle when we
>> introduce a scan-per-message architecture.
>> 
>> Casey
>> 
>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <Dima.Kovalyov@sstech.us>
>> wrote:
>> 
>>> Hello Justin,
>>> 
>>> Considering that Metron uses hbase tables for storing enrichment and
>>> threatintel feeds, can we use Hbase for geo enrichment as well?
>>> Or MapDB can be used for enrichment and threatintel feeds instead of
>> hbase?
>>> 
>>> - Dima
>>> 
>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
>>>> Hi all,
>>>> 
>>>> As a bit of background, right now, GeoIP data is loaded into and
>> managed
>>> by
>>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
>>>> dependency on it before next release). We currently depend on and
>> install
>>>> an instance of MySQL (in each of the Management Pack, Ansible, and
>> Docker
>>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
>>> and
>>>> query for a given IP.  Additionally, it's a single point of failure for
>>>> that particular enrichment right now.  If MySQL is down, geo enrichment
>>>> can't occur.
>>>> 
>>>> I'm proposing that we eliminate the use of MySQL entirely, through all
>>>> installation paths (which, unless I missed some, includes Ansible, the
>>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
>>>> various MySQL setup and management through the code, along with all the
>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
>> setup
>>>> their own databases for enrichments and install connectors is able to
>> do
>>> so.
>>>> 
>>>> In its place, I've looked at using MapDB, which is a really easy to use
>>>> library for creating Java collections backed by a file (This is NOT a
>>>> separate installation of anything, it's just a jar that manages
>>> interaction
>>>> with the file system).  Given the slow churn of the GeoIP files (I
>>> believe
>>>> they get updated once a week), we can have a script that can be run
>> when
>>>> needed, downloads the MaxMind tar file, builds the MapDB file that will
>>> be
>>>> used by the bolts, and places it into HDFS.  Finally, we update a
>> config
>>> to
>>>> point to the new file, the bolts get the updated config callback and
>> can
>>>> update their db files.  Inside the code, we wrap the MapDB portions to
>>> make
>>>> it transparent to downstream code.
>>>> 
>>>> The particularly nice parts about using MapDB are that its ease of use
>>> plus
>>>> it offers the utilities we need out of the box to be able to support
>> the
>>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
>>> and
>>>> we need to be able to easily grab the appropriate range).
>>>> 
>>>> The main point of concern I have about this is that when we grab the
>> HDFS
>>>> file during an update, given that multiple JVMs can be running, we
>> don't
>>>> want them to clobber each other. I believe this can be avoided by
>> simply
>>>> using each worker's working directory to store the file (and
>>> appropriately
>>>> ensure threads on the same JVM manage multithreading).  This should
>> keep
>>>> the JVMs (and the underlying DB files) entirely independent.
>>>> 
>>>> This script would get called by the various installations during
>> startup
>>> to
>>>> do the initial setup.  After install, it can then be called on demand
>> in
>>>> order.
>>>> 
>>>> At this point, we should be all set, with everything running and
>>> updatable.
>>>> 
>>>> Justin
>>>> 
>>> 
>>> 
>> 


Mime
View raw message