metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Foley <ma...@apache.org>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 19:27:38 GMT
Hi Justin, and team,
Several components of the Hadoop Stack utilize a SQL database, usually for metadata of some
sort.  Ambari knows this and arranges for them to share a single database installation (on
or off the cluster), unless they explicitly configure use of different databases (which is
allowed for sites that desire it).  Ambari defaults to using PostgreSQL, altho it’s happy
to use MySQL, Oracle, or Microsoft, along with whatever each component historically defined
as their default (such as Derby).

If we want to start with a replacement of current functionality, I would suggest switching
the default database to PostgreSQL.  Replacing fast, efficient, and proven db services with
a file-based api library (but no standard way to propagate the underlying storage files) seems
to me to be taking a step backwards.

Sticking with a SQL-based service will surely minimize the amount of code changes needed.
 And making the SQL either dialect-independent or capable of switching among dialects, then
enables us to do what the rest of the Hadoop stack does:  allow enterprise customers to substitute
Oracle or Microsoft enterprise-class databases where they wish.  Regarding the drivers, we
should study what the other Stack components do; I’m not an expert in those areas.

Using the same db as the rest of the stack also means administrators can be confident they’ve
set up adequate backup and recovery processes.
All these are valuable reasons not to roll our own storage system for this enrichment data.
 IMO, of course.

Cheers,
--Matt


On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichardson2@gmail.com> wrote:

    +1 Agree with David's order
    
    -Kyle
    
    On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65535@gmail.com> wrote:
    
    > Def agree on the parity point.
    >
    > I'm a little worried about Supervisor relocations for non-HBase solutions,
    > but having much of the work done for us by MaxMind changes my preference to
    > (in order)
    >
    > 1) MM API
    > 2) HBase Enrichment
    > 3) MapDB should the others prove not feasible
    >
    >
    > -D...
    >
    >
    > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <justinjleet@gmail.com>
    > wrote:
    >
    > > I definitely agree on checking out the MaxMind API.  I'll take a look at
    > > it, but at first glance it looks like it does include everything we use.
    > > Great find, JJ.
    > >
    > > More details on various people's points:
    > >
    > > As a note to anyone hopping in, Simon's point on the range lookup vs a
    > key
    > > lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to
    > what
    > > Simon mentioned, denormalizing is easy enough and turns it into an easy
    > > range lookup.
    > >
    > > To David's point, the MapDB does require a network hop, but it's once per
    > > refresh of the data (Got a relevant callback? Grab new data, load it,
    > swap
    > > out) instead of (up to) once per message.  I would expect the same to be
    > > true of the MaxMind db files.
    > >
    > > I'd also argue MapDB not really more complex than refreshing the HBase
    > > table, because we potentially have to start worrying about things like
    > > hashing and/or indices and even just general data represtation. It's
    > > definitely correct that the file processing has to occur on either path,
    > so
    > > it really boils down to handling the callback and reloading the file vs
    > > handling some of the standard HBasey things.  I don't think either is an
    > > enormous amount of work (and both are almost certainly more work than
    > > MaxMind's API)
    > >
    > > Regarding extensibility, I'd argue for parity with what we have first,
    > then
    > > build what we need from there.  Does anybody have any disagreement with
    > > that approach for right now?
    > >
    > > Justin
    > >
    > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dlyle65535@gmail.com>
    > wrote:
    > >
    > > > It is interesting- save us a ton of effort, and has the right license.
    > I
    > > > think it's worth at least checking out.
    > > >
    > > > -D...
    > > >
    > > >
    > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
    > > > simon@simonellistonball.com> wrote:
    > > >
    > > > > I like that approach even more. That way we would only have to worry
    > > > about
    > > > > distributing the database file in binary format to all the supervisor
    > > > nodes
    > > > > on update.
    > > > >
    > > > > It would also make it easier for people to switch to the enterprise
    > DB
    > > > > potentially if they had the license.
    > > > >
    > > > > One slight issue with this might be for people who wanted to extend
    > the
    > > > > database. For example, organisations may want to add geo-enrichment
    > to
    > > > > their own private network addresses based modified versions of the
    > geo
    > > > > database. Currently we don’t really allow this, since we hard-code
    > > > ignoring
    > > > > private network classes into the geo enrichment adapter, but I can
    > see
    > > a
    > > > > case where a global org might want to add their own ranges and
    > > locations
    > > > to
    > > > > the data set. Does that make sense to anyone else?
    > > > >
    > > > > Simon
    > > > >
    > > > >
    > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmeyer0@gmail.com>
wrote:
    > > > > >
    > > > > > Hello all,
    > > > > >
    > > > > > Can we leverage maxmind's Java client (
    > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
    > > > > main/java/com/maxmind/geoip2)
    > > > > > in this case? I believe it can directly read maxmind file. Plus
I
    > > think
    > > > > it
    > > > > > also has some support for caching as well.
    > > > > >
    > > > > > Thanks,
    > > > > > JJ
    > > > > >
    > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
    > > > > > simon@simonellistonball.com> wrote:
    > > > > >
    > > > > >> I like the idea of MapDB, since we can essentially pull an
    > instance
    > > > into
    > > > > >> each supervisor, so it makes a lot of sense for relatively
small
    > > > scale,
    > > > > >> relatively static enrichments in general.
    > > > > >>
    > > > > >> Generally this feels like a caching problem, and would be
for a
    > > simple
    > > > > >> key-value lookup. In that case I would agree with David Lyle
on
    > > using
    > > > > HBase
    > > > > >> as a source or truth and relying on caching.
    > > > > >>
    > > > > >> That said, GeoIP is a different lookup pattern, since it’s
a range
    > > > > lookup
    > > > > >> then a key lookup (or if we denormalize the MaxMind data,
just a
    > > range
    > > > > >> lookup) for that kind of thing MapDB with something like
the BTree
    > > > > seems a
    > > > > >> good fit.
    > > > > >>
    > > > > >> Simon
    > > > > >>
    > > > > >>
    > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <dlyle65535@gmail.com>
    > wrote:
    > > > > >>>
    > > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer
to see
    > it
    > > > as
    > > > > an
    > > > > >>> HBase enrichment. If our current caching isn't enough
to mitigate
    > > the
    > > > > >> above
    > > > > >>> issues, we have a problem, don't we? Or do we not recommend
HBase
    > > > > >>> enrichment for per message enrichment in general?
    > > > > >>>
    > > > > >>> Also- can you elaborate on how MapDB would not require
a network
    > > hop?
    > > > > >>> Doesn't this mean we would have to sync the enrichment
data to
    > each
    > > > > Storm
    > > > > >>> supervisor? HDFS could (probably would) have a network
hop too,
    > no?
    > > > > >>>
    > > > > >>> Fwiw -
    > > > > >>> "In its place, I've looked at using MapDB, which is a
really easy
    > > to
    > > > > use
    > > > > >>> library for creating Java collections backed by a file
(This is
    > > NOT a
    > > > > >>> separate installation of anything, it's just a jar that
manages
    > > > > >> interaction
    > > > > >>> with the file system).  Given the slow churn of the GeoIP
files
    > (I
    > > > > >> believe
    > > > > >>> they get updated once a week), we can have a script that
can be
    > run
    > > > > when
    > > > > >>> needed, downloads the MaxMind tar file, builds the MapDB
file
    > that
    > > > will
    > > > > >> be
    > > > > >>> used by the bolts, and places it into HDFS.  Finally,
we update a
    > > > > config
    > > > > >> to
    > > > > >>> point to the new file, the bolts get the updated config
callback
    > > and
    > > > > can
    > > > > >>> update their db files.  Inside the code, we wrap the
MapDB
    > portions
    > > > to
    > > > > >> make
    > > > > >>> it transparent to downstream code."
    > > > > >>>
    > > > > >>> Seems a bit more complex than "refresh the hbase table".
Afaik,
    > > > either
    > > > > >>> approach would require some sort of translation between
GeoIP
    > > source
    > > > > >> format
    > > > > >>> and target format, so that part is a wash imo.
    > > > > >>>
    > > > > >>> So, I'd really like to see, at least, an attempt to leverage
    > HBase
    > > > > >>> enrichment.
    > > > > >>>
    > > > > >>> -D...
    > > > > >>>
    > > > > >>>
    > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
    > cestella@gmail.com
    > > >
    > > > > >> wrote:
    > > > > >>>
    > > > > >>>> I think that it's a sensible thing to use MapDB for
the geo
    > > > > enrichment.
    > > > > >>>> Let me state my reasoning:
    > > > > >>>>
    > > > > >>>>  - An HBase implementation  would necessitate a HBase
scan
    > > possibly
    > > > > >>>>  hitting HDFS, which is expensive per-message.
    > > > > >>>>  - An HBase implementation would necessitate a network
hop and
    > > MapDB
    > > > > >>>>  would not.
    > > > > >>>>
    > > > > >>>> I also think this might be the beginning of a more
general
    > purpose
    > > > > >> support
    > > > > >>>> in Stellar for locally shipped, read-only MapDB lookups,
which
    > > might
    > > > > be
    > > > > >>>> interesting.
    > > > > >>>>
    > > > > >>>> In short, all quotes about premature optimization
are sure to
    > > apply
    > > > to
    > > > > >> my
    > > > > >>>> reasoning, but I can't help but have my spidey senses
tingle
    > when
    > > we
    > > > > >>>> introduce a scan-per-message architecture.
    > > > > >>>>
    > > > > >>>> Casey
    > > > > >>>>
    > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
    > > > > >> Dima.Kovalyov@sstech.us>
    > > > > >>>> wrote:
    > > > > >>>>
    > > > > >>>>> Hello Justin,
    > > > > >>>>>
    > > > > >>>>> Considering that Metron uses hbase tables for
storing
    > enrichment
    > > > and
    > > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment
as well?
    > > > > >>>>> Or MapDB can be used for enrichment and threatintel
feeds
    > instead
    > > > of
    > > > > >>>> hbase?
    > > > > >>>>>
    > > > > >>>>> - Dima
    > > > > >>>>>
    > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
    > > > > >>>>>> Hi all,
    > > > > >>>>>>
    > > > > >>>>>> As a bit of background, right now, GeoIP
data is loaded into
    > and
    > > > > >>>> managed
    > > > > >>>>> by
    > > > > >>>>>> MySQL (the connectors are LGPL licensed and
we need to sever
    > our
    > > > > Maven
    > > > > >>>>>> dependency on it before next release). We
currently depend on
    > > and
    > > > > >>>> install
    > > > > >>>>>> an instance of MySQL (in each of the Management
Pack, Ansible,
    > > and
    > > > > >>>> Docker
    > > > > >>>>>> installs). In the topology, we use the JDBCAdapter
to connect
    > to
    > > > > MySQL
    > > > > >>>>> and
    > > > > >>>>>> query for a given IP.  Additionally, it's
a single point of
    > > > failure
    > > > > >> for
    > > > > >>>>>> that particular enrichment right now.  If
MySQL is down, geo
    > > > > >> enrichment
    > > > > >>>>>> can't occur.
    > > > > >>>>>>
    > > > > >>>>>> I'm proposing that we eliminate the use of
MySQL entirely,
    > > through
    > > > > all
    > > > > >>>>>> installation paths (which, unless I missed
some, includes
    > > Ansible,
    > > > > the
    > > > > >>>>>> Ambari Management Pack, and Docker).  We'd
do this by dropping
    > > all
    > > > > the
    > > > > >>>>>> various MySQL setup and management through
the code, along
    > with
    > > > all
    > > > > >> the
    > > > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so
that anybody who
    > wants
    > > > to
    > > > > >>>> setup
    > > > > >>>>>> their own databases for enrichments and install
connectors is
    > > able
    > > > > to
    > > > > >>>> do
    > > > > >>>>> so.
    > > > > >>>>>>
    > > > > >>>>>> In its place, I've looked at using MapDB,
which is a really
    > easy
    > > > to
    > > > > >> use
    > > > > >>>>>> library for creating Java collections backed
by a file (This
    > is
    > > > NOT
    > > > > a
    > > > > >>>>>> separate installation of anything, it's just
a jar that
    > manages
    > > > > >>>>> interaction
    > > > > >>>>>> with the file system).  Given the slow churn
of the GeoIP
    > files
    > > (I
    > > > > >>>>> believe
    > > > > >>>>>> they get updated once a week), we can have
a script that can
    > be
    > > > run
    > > > > >>>> when
    > > > > >>>>>> needed, downloads the MaxMind tar file, builds
the MapDB file
    > > that
    > > > > >> will
    > > > > >>>>> be
    > > > > >>>>>> used by the bolts, and places it into HDFS.
 Finally, we
    > update
    > > a
    > > > > >>>> config
    > > > > >>>>> to
    > > > > >>>>>> point to the new file, the bolts get the
updated config
    > callback
    > > > and
    > > > > >>>> can
    > > > > >>>>>> update their db files.  Inside the code,
we wrap the MapDB
    > > > portions
    > > > > to
    > > > > >>>>> make
    > > > > >>>>>> it transparent to downstream code.
    > > > > >>>>>>
    > > > > >>>>>> The particularly nice parts about using MapDB
are that its
    > ease
    > > of
    > > > > use
    > > > > >>>>> plus
    > > > > >>>>>> it offers the utilities we need out of the
box to be able to
    > > > support
    > > > > >>>> the
    > > > > >>>>>> operations we need on this (Keep in mind
the GeoIP files use
    > IP
    > > > > ranges
    > > > > >>>>> and
    > > > > >>>>>> we need to be able to easily grab the appropriate
range).
    > > > > >>>>>>
    > > > > >>>>>> The main point of concern I have about this
is that when we
    > grab
    > > > the
    > > > > >>>> HDFS
    > > > > >>>>>> file during an update, given that multiple
JVMs can be
    > running,
    > > we
    > > > > >>>> don't
    > > > > >>>>>> want them to clobber each other. I believe
this can be avoided
    > > by
    > > > > >>>> simply
    > > > > >>>>>> using each worker's working directory to
store the file (and
    > > > > >>>>> appropriately
    > > > > >>>>>> ensure threads on the same JVM manage multithreading).
 This
    > > > should
    > > > > >>>> keep
    > > > > >>>>>> the JVMs (and the underlying DB files) entirely
independent.
    > > > > >>>>>>
    > > > > >>>>>> This script would get called by the various
installations
    > during
    > > > > >>>> startup
    > > > > >>>>> to
    > > > > >>>>>> do the initial setup.  After install, it
can then be called on
    > > > > demand
    > > > > >>>> in
    > > > > >>>>>> order.
    > > > > >>>>>>
    > > > > >>>>>> At this point, we should be all set, with
everything running
    > and
    > > > > >>>>> updatable.
    > > > > >>>>>>
    > > > > >>>>>> Justin
    > > > > >>>>>>
    > > > > >>>>>
    > > > > >>>>>
    > > > > >>>>
    > > > > >>
    > > > > >>
    > > > >
    > > > >
    > > >
    > >
    >
    




Mime
View raw message