metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Leet <justinjl...@gmail.com>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 21:10:57 GMT
For MapDB, the short version is that it can do what we need without issue.
I've been able to load the data up and get back correct results with it,
although using MM's data is definitely easier and better.  I'm very happy
with dropping MapDB from the conversation in favor of the MM data, assuming
nobody strongly supports it.  The access pattern was the reason for MapDB
in the first place, and MM's built solution handles it more gracefully.

For the MM data, +1 to JJ's notes.  For more context, the client can load
up the binary version of their free data (or presumably their paid version
also).  I've already played around with this and it definitely works (and
provides more data than we currently pass on from the info we currently get
from the old API).  It does not load into an RDBMS, it's pretty much the
same concept as MapDB here, but with their binary format.  See:
http://dev.maxmind.com/geoip/geoip2/geolite2/
For the data we use: http://geolite.maxmind.com/download/geoip/datab
ase/GeoLite2-City.mmdb.gz

It has the benefit of being both stuff that MaxMind has written and
avoiding network hops, assuming we distribute it in a way everyone is happy
with. I don't see HDFS being unreasonable for the file to live, given that
it's updated once a week.

Alternatively, and I'm not a Storm expert, but it might also be possible to
throw the file in the DistributedCache.  The main concern I have there is
that I don't know that it provides callbacks on update, so it might require
using tick tuples every X interval.  It's definitely something I'd be
hesitant to use without more info on that. If someone can speak to the
viability of it, it shouldn't be very hard to do.

The main problems I have with SQL databases in this particular case are
1) They requires a network hop for every message that isn't cached.
2) They represents a single point of failure in the enrichment process

Justin

On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmeyer0@gmail.com> wrote:

> Matt, I agree with your points on why we shouldn't just get rid of the
> database just to get rid of a database. But IMO, I think we may be
> reinventing the wheel a little bit by even putting the maxmind data into
> MySQL. Right now we are already downloading a maxmind file. To me it seems
> simpler to push the file to HDFS where we can pick it up and have the
> maxmind client use that instead of importing data into a DB and then
> running a query. Also, I believe the data gets updated weekly. So syncing
> may become easier too.
>
> James, I believe it works with the paid and free versions of geoip. I know
> NiFi uses this client library in their Geo enrichment processor.
>
> Also, if it is decided that using a SQL database is still the best
> solution, I think there is a benefit to using their library. We would just
> have to implement a `DatabaseProvider` that hits some SQL db instead of
> using their standard implementation.
>
> Thanks,
> JJ
>
> On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsirota@apache.org> wrote:
>
> > Hi Guys, I just wanted to clarify one point that I think is lost in this
> > tread.  Geo enrichment is NOT a key-value enrichment.  It requires a
> range
> > scan and a join (which is why it's implemented via mySql and not Hbase).
> > To account for this access pattern via a key-value store you would
> > inevitably have to do something funky or in case of Hbase I don't think
> > there is a way to avoid doing a range scan.
> >
> > With respect to mapdb it only has support for Maps, Sets, Lists, Queues.
> > Are we sure it provides enough functionality for us to do this
> enrichment?
> >
> > With respect to the Maxmind client, are we sure we can use it on the
> > mySql-backed version of their DB?  I thought the Maxmind database itself
> is
> > proprietary and is something you have to pay for.  My understanding is
> that
> > the client is designed for that proprietary version.
> >
> > I somewhat agree with Matt's point.  If mySql is a problem because of
> > licensing, the path of least resistance to remove mySql dependencies
> would
> > be to simply switch to postgresql.  We will always have conventional sql
> > databases in our stack because other big data tools use them. Why not
> take
> > advantage of them too?
> >
> > Thanks,
> > James
> >
> > 16.01.2017, 12:27, "Matt Foley" <mattf@apache.org>:
> > > Hi Justin, and team,
> > > Several components of the Hadoop Stack utilize a SQL database, usually
> > for metadata of some sort. Ambari knows this and arranges for them to
> share
> > a single database installation (on or off the cluster), unless they
> > explicitly configure use of different databases (which is allowed for
> sites
> > that desire it). Ambari defaults to using PostgreSQL, altho it’s happy to
> > use MySQL, Oracle, or Microsoft, along with whatever each component
> > historically defined as their default (such as Derby).
> > >
> > > If we want to start with a replacement of current functionality, I
> would
> > suggest switching the default database to PostgreSQL. Replacing fast,
> > efficient, and proven db services with a file-based api library (but no
> > standard way to propagate the underlying storage files) seems to me to be
> > taking a step backwards.
> > >
> > > Sticking with a SQL-based service will surely minimize the amount of
> > code changes needed. And making the SQL either dialect-independent or
> > capable of switching among dialects, then enables us to do what the rest
> of
> > the Hadoop stack does: allow enterprise customers to substitute Oracle or
> > Microsoft enterprise-class databases where they wish. Regarding the
> > drivers, we should study what the other Stack components do; I’m not an
> > expert in those areas.
> > >
> > > Using the same db as the rest of the stack also means administrators
> can
> > be confident they’ve set up adequate backup and recovery processes.
> > > All these are valuable reasons not to roll our own storage system for
> > this enrichment data. IMO, of course.
> > >
> > > Cheers,
> > > --Matt
> > >
> > > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichardson2@gmail.com>
> > wrote:
> > >
> > >     +1 Agree with David's order
> > >
> > >     -Kyle
> > >
> > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65535@gmail.com
> >
> > wrote:
> > >
> > >     > Def agree on the parity point.
> > >     >
> > >     > I'm a little worried about Supervisor relocations for non-HBase
> > solutions,
> > >     > but having much of the work done for us by MaxMind changes my
> > preference to
> > >     > (in order)
> > >     >
> > >     > 1) MM API
> > >     > 2) HBase Enrichment
> > >     > 3) MapDB should the others prove not feasible
> > >     >
> > >     >
> > >     > -D...
> > >     >
> > >     >
> > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
> > justinjleet@gmail.com>
> > >     > wrote:
> > >     >
> > >     > > I definitely agree on checking out the MaxMind API. I'll take
a
> > look at
> > >     > > it, but at first glance it looks like it does include
> everything
> > we use.
> > >     > > Great find, JJ.
> > >     > >
> > >     > > More details on various people's points:
> > >     > >
> > >     > > As a note to anyone hopping in, Simon's point on the range
> > lookup vs a
> > >     > key
> > >     > > lookup is why it becomes a Scan in HBase vs a Get. As an
> > addendum to
> > >     > what
> > >     > > Simon mentioned, denormalizing is easy enough and turns it into
> > an easy
> > >     > > range lookup.
> > >     > >
> > >     > > To David's point, the MapDB does require a network hop, but
> it's
> > once per
> > >     > > refresh of the data (Got a relevant callback? Grab new data,
> > load it,
> > >     > swap
> > >     > > out) instead of (up to) once per message. I would expect the
> > same to be
> > >     > > true of the MaxMind db files.
> > >     > >
> > >     > > I'd also argue MapDB not really more complex than refreshing
> the
> > HBase
> > >     > > table, because we potentially have to start worrying about
> > things like
> > >     > > hashing and/or indices and even just general data represtation.
> > It's
> > >     > > definitely correct that the file processing has to occur on
> > either path,
> > >     > so
> > >     > > it really boils down to handling the callback and reloading the
> > file vs
> > >     > > handling some of the standard HBasey things. I don't think
> > either is an
> > >     > > enormous amount of work (and both are almost certainly more
> work
> > than
> > >     > > MaxMind's API)
> > >     > >
> > >     > > Regarding extensibility, I'd argue for parity with what we have
> > first,
> > >     > then
> > >     > > build what we need from there. Does anybody have any
> > disagreement with
> > >     > > that approach for right now?
> > >     > >
> > >     > > Justin
> > >     > >
> > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
> > dlyle65535@gmail.com>
> > >     > wrote:
> > >     > >
> > >     > > > It is interesting- save us a ton of effort, and has the
right
> > license.
> > >     > I
> > >     > > > think it's worth at least checking out.
> > >     > > >
> > >     > > > -D...
> > >     > > >
> > >     > > >
> > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > >     > > > simon@simonellistonball.com> wrote:
> > >     > > >
> > >     > > > > I like that approach even more. That way we would only
have
> > to worry
> > >     > > > about
> > >     > > > > distributing the database file in binary format to
all the
> > supervisor
> > >     > > > nodes
> > >     > > > > on update.
> > >     > > > >
> > >     > > > > It would also make it easier for people to switch to
the
> > enterprise
> > >     > DB
> > >     > > > > potentially if they had the license.
> > >     > > > >
> > >     > > > > One slight issue with this might be for people who
wanted
> to
> > extend
> > >     > the
> > >     > > > > database. For example, organisations may want to add
> > geo-enrichment
> > >     > to
> > >     > > > > their own private network addresses based modified
versions
> > of the
> > >     > geo
> > >     > > > > database. Currently we don’t really allow this, since
we
> > hard-code
> > >     > > > ignoring
> > >     > > > > private network classes into the geo enrichment adapter,
> but
> > I can
> > >     > see
> > >     > > a
> > >     > > > > case where a global org might want to add their own
ranges
> > and
> > >     > > locations
> > >     > > > to
> > >     > > > > the data set. Does that make sense to anyone else?
> > >     > > > >
> > >     > > > > Simon
> > >     > > > >
> > >     > > > >
> > >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmeyer0@gmail.com>
> > wrote:
> > >     > > > > >
> > >     > > > > > Hello all,
> > >     > > > > >
> > >     > > > > > Can we leverage maxmind's Java client (
> > >     > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > >     > > > > main/java/com/maxmind/geoip2)
> > >     > > > > > in this case? I believe it can directly read maxmind
> file.
> > Plus I
> > >     > > think
> > >     > > > > it
> > >     > > > > > also has some support for caching as well.
> > >     > > > > >
> > >     > > > > > Thanks,
> > >     > > > > > JJ
> > >     > > > > >
> > >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston
Ball <
> > >     > > > > > simon@simonellistonball.com> wrote:
> > >     > > > > >
> > >     > > > > >> I like the idea of MapDB, since we can essentially
pull
> an
> > >     > instance
> > >     > > > into
> > >     > > > > >> each supervisor, so it makes a lot of sense
for
> > relatively small
> > >     > > > scale,
> > >     > > > > >> relatively static enrichments in general.
> > >     > > > > >>
> > >     > > > > >> Generally this feels like a caching problem,
and would
> be
> > for a
> > >     > > simple
> > >     > > > > >> key-value lookup. In that case I would agree
with David
> > Lyle on
> > >     > > using
> > >     > > > > HBase
> > >     > > > > >> as a source or truth and relying on caching.
> > >     > > > > >>
> > >     > > > > >> That said, GeoIP is a different lookup pattern,
since
> > it’s a range
> > >     > > > > lookup
> > >     > > > > >> then a key lookup (or if we denormalize the
MaxMind
> data,
> > just a
> > >     > > range
> > >     > > > > >> lookup) for that kind of thing MapDB with
something like
> > the BTree
> > >     > > > > seems a
> > >     > > > > >> good fit.
> > >     > > > > >>
> > >     > > > > >> Simon
> > >     > > > > >>
> > >     > > > > >>
> > >     > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <
> > dlyle65535@gmail.com>
> > >     > wrote:
> > >     > > > > >>>
> > >     > > > > >>> I'm +1 on removing the MySQL dependency,
BUT - I'd
> > prefer to see
> > >     > it
> > >     > > > as
> > >     > > > > an
> > >     > > > > >>> HBase enrichment. If our current caching
isn't enough
> to
> > mitigate
> > >     > > the
> > >     > > > > >> above
> > >     > > > > >>> issues, we have a problem, don't we? Or
do we not
> > recommend HBase
> > >     > > > > >>> enrichment for per message enrichment
in general?
> > >     > > > > >>>
> > >     > > > > >>> Also- can you elaborate on how MapDB would
not require
> a
> > network
> > >     > > hop?
> > >     > > > > >>> Doesn't this mean we would have to sync
the enrichment
> > data to
> > >     > each
> > >     > > > > Storm
> > >     > > > > >>> supervisor? HDFS could (probably would)
have a network
> > hop too,
> > >     > no?
> > >     > > > > >>>
> > >     > > > > >>> Fwiw -
> > >     > > > > >>> "In its place, I've looked at using MapDB,
which is a
> > really easy
> > >     > > to
> > >     > > > > use
> > >     > > > > >>> library for creating Java collections
backed by a file
> > (This is
> > >     > > NOT a
> > >     > > > > >>> separate installation of anything, it's
just a jar that
> > manages
> > >     > > > > >> interaction
> > >     > > > > >>> with the file system). Given the slow
churn of the
> GeoIP
> > files
> > >     > (I
> > >     > > > > >> believe
> > >     > > > > >>> they get updated once a week), we can
have a script
> that
> > can be
> > >     > run
> > >     > > > > when
> > >     > > > > >>> needed, downloads the MaxMind tar file,
builds the
> MapDB
> > file
> > >     > that
> > >     > > > will
> > >     > > > > >> be
> > >     > > > > >>> used by the bolts, and places it into
HDFS. Finally, we
> > update a
> > >     > > > > config
> > >     > > > > >> to
> > >     > > > > >>> point to the new file, the bolts get the
updated config
> > callback
> > >     > > and
> > >     > > > > can
> > >     > > > > >>> update their db files. Inside the code,
we wrap the
> MapDB
> > >     > portions
> > >     > > > to
> > >     > > > > >> make
> > >     > > > > >>> it transparent to downstream code."
> > >     > > > > >>>
> > >     > > > > >>> Seems a bit more complex than "refresh
the hbase
> table".
> > Afaik,
> > >     > > > either
> > >     > > > > >>> approach would require some sort of translation
between
> > GeoIP
> > >     > > source
> > >     > > > > >> format
> > >     > > > > >>> and target format, so that part is a wash
imo.
> > >     > > > > >>>
> > >     > > > > >>> So, I'd really like to see, at least,
an attempt to
> > leverage
> > >     > HBase
> > >     > > > > >>> enrichment.
> > >     > > > > >>>
> > >     > > > > >>> -D...
> > >     > > > > >>>
> > >     > > > > >>>
> > >     > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey
Stella <
> > >     > cestella@gmail.com
> > >     > > >
> > >     > > > > >> wrote:
> > >     > > > > >>>
> > >     > > > > >>>> I think that it's a sensible thing
to use MapDB for
> the
> > geo
> > >     > > > > enrichment.
> > >     > > > > >>>> Let me state my reasoning:
> > >     > > > > >>>>
> > >     > > > > >>>> - An HBase implementation would necessitate
a HBase
> scan
> > >     > > possibly
> > >     > > > > >>>> hitting HDFS, which is expensive per-message.
> > >     > > > > >>>> - An HBase implementation would necessitate
a network
> > hop and
> > >     > > MapDB
> > >     > > > > >>>> would not.
> > >     > > > > >>>>
> > >     > > > > >>>> I also think this might be the beginning
of a more
> > general
> > >     > purpose
> > >     > > > > >> support
> > >     > > > > >>>> in Stellar for locally shipped, read-only
MapDB
> > lookups, which
> > >     > > might
> > >     > > > > be
> > >     > > > > >>>> interesting.
> > >     > > > > >>>>
> > >     > > > > >>>> In short, all quotes about premature
optimization are
> > sure to
> > >     > > apply
> > >     > > > to
> > >     > > > > >> my
> > >     > > > > >>>> reasoning, but I can't help but have
my spidey senses
> > tingle
> > >     > when
> > >     > > we
> > >     > > > > >>>> introduce a scan-per-message architecture.
> > >     > > > > >>>>
> > >     > > > > >>>> Casey
> > >     > > > > >>>>
> > >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM,
Dima Kovalyov <
> > >     > > > > >> Dima.Kovalyov@sstech.us>
> > >     > > > > >>>> wrote:
> > >     > > > > >>>>
> > >     > > > > >>>>> Hello Justin,
> > >     > > > > >>>>>
> > >     > > > > >>>>> Considering that Metron uses hbase
tables for storing
> > >     > enrichment
> > >     > > > and
> > >     > > > > >>>>> threatintel feeds, can we use
Hbase for geo
> enrichment
> > as well?
> > >     > > > > >>>>> Or MapDB can be used for enrichment
and threatintel
> > feeds
> > >     > instead
> > >     > > > of
> > >     > > > > >>>> hbase?
> > >     > > > > >>>>>
> > >     > > > > >>>>> - Dima
> > >     > > > > >>>>>
> > >     > > > > >>>>> On 01/16/2017 04:17 PM, Justin
Leet wrote:
> > >     > > > > >>>>>> Hi all,
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> As a bit of background, right
now, GeoIP data is
> > loaded into
> > >     > and
> > >     > > > > >>>> managed
> > >     > > > > >>>>> by
> > >     > > > > >>>>>> MySQL (the connectors are
LGPL licensed and we need
> > to sever
> > >     > our
> > >     > > > > Maven
> > >     > > > > >>>>>> dependency on it before next
release). We currently
> > depend on
> > >     > > and
> > >     > > > > >>>> install
> > >     > > > > >>>>>> an instance of MySQL (in each
of the Management
> Pack,
> > Ansible,
> > >     > > and
> > >     > > > > >>>> Docker
> > >     > > > > >>>>>> installs). In the topology,
we use the JDBCAdapter
> to
> > connect
> > >     > to
> > >     > > > > MySQL
> > >     > > > > >>>>> and
> > >     > > > > >>>>>> query for a given IP. Additionally,
it's a single
> > point of
> > >     > > > failure
> > >     > > > > >> for
> > >     > > > > >>>>>> that particular enrichment
right now. If MySQL is
> > down, geo
> > >     > > > > >> enrichment
> > >     > > > > >>>>>> can't occur.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> I'm proposing that we eliminate
the use of MySQL
> > entirely,
> > >     > > through
> > >     > > > > all
> > >     > > > > >>>>>> installation paths (which,
unless I missed some,
> > includes
> > >     > > Ansible,
> > >     > > > > the
> > >     > > > > >>>>>> Ambari Management Pack, and
Docker). We'd do this by
> > dropping
> > >     > > all
> > >     > > > > the
> > >     > > > > >>>>>> various MySQL setup and management
through the code,
> > along
> > >     > with
> > >     > > > all
> > >     > > > > >> the
> > >     > > > > >>>>>> DDL, etc. The JDBCAdapter
would stay, so that
> anybody
> > who
> > >     > wants
> > >     > > > to
> > >     > > > > >>>> setup
> > >     > > > > >>>>>> their own databases for enrichments
and install
> > connectors is
> > >     > > able
> > >     > > > > to
> > >     > > > > >>>> do
> > >     > > > > >>>>> so.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> In its place, I've looked
at using MapDB, which is a
> > really
> > >     > easy
> > >     > > > to
> > >     > > > > >> use
> > >     > > > > >>>>>> library for creating Java
collections backed by a
> > file (This
> > >     > is
> > >     > > > NOT
> > >     > > > > a
> > >     > > > > >>>>>> separate installation of anything,
it's just a jar
> > that
> > >     > manages
> > >     > > > > >>>>> interaction
> > >     > > > > >>>>>> with the file system). Given
the slow churn of the
> > GeoIP
> > >     > files
> > >     > > (I
> > >     > > > > >>>>> believe
> > >     > > > > >>>>>> they get updated once a week),
we can have a script
> > that can
> > >     > be
> > >     > > > run
> > >     > > > > >>>> when
> > >     > > > > >>>>>> needed, downloads the MaxMind
tar file, builds the
> > MapDB file
> > >     > > that
> > >     > > > > >> will
> > >     > > > > >>>>> be
> > >     > > > > >>>>>> used by the bolts, and places
it into HDFS. Finally,
> > we
> > >     > update
> > >     > > a
> > >     > > > > >>>> config
> > >     > > > > >>>>> to
> > >     > > > > >>>>>> point to the new file, the
bolts get the updated
> > config
> > >     > callback
> > >     > > > and
> > >     > > > > >>>> can
> > >     > > > > >>>>>> update their db files. Inside
the code, we wrap the
> > MapDB
> > >     > > > portions
> > >     > > > > to
> > >     > > > > >>>>> make
> > >     > > > > >>>>>> it transparent to downstream
code.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> The particularly nice parts
about using MapDB are
> > that its
> > >     > ease
> > >     > > of
> > >     > > > > use
> > >     > > > > >>>>> plus
> > >     > > > > >>>>>> it offers the utilities we
need out of the box to be
> > able to
> > >     > > > support
> > >     > > > > >>>> the
> > >     > > > > >>>>>> operations we need on this
(Keep in mind the GeoIP
> > files use
> > >     > IP
> > >     > > > > ranges
> > >     > > > > >>>>> and
> > >     > > > > >>>>>> we need to be able to easily
grab the appropriate
> > range).
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> The main point of concern
I have about this is that
> > when we
> > >     > grab
> > >     > > > the
> > >     > > > > >>>> HDFS
> > >     > > > > >>>>>> file during an update, given
that multiple JVMs can
> be
> > >     > running,
> > >     > > we
> > >     > > > > >>>> don't
> > >     > > > > >>>>>> want them to clobber each
other. I believe this can
> > be avoided
> > >     > > by
> > >     > > > > >>>> simply
> > >     > > > > >>>>>> using each worker's working
directory to store the
> > file (and
> > >     > > > > >>>>> appropriately
> > >     > > > > >>>>>> ensure threads on the same
JVM manage
> > multithreading). This
> > >     > > > should
> > >     > > > > >>>> keep
> > >     > > > > >>>>>> the JVMs (and the underlying
DB files) entirely
> > independent.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> This script would get called
by the various
> > installations
> > >     > during
> > >     > > > > >>>> startup
> > >     > > > > >>>>> to
> > >     > > > > >>>>>> do the initial setup. After
install, it can then be
> > called on
> > >     > > > > demand
> > >     > > > > >>>> in
> > >     > > > > >>>>>> order.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> At this point, we should be
all set, with everything
> > running
> > >     > and
> > >     > > > > >>>>> updatable.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> Justin
> > >     > > > > >>>>>>
> > >     > > > > >>>>>
> > >     > > > > >>>>>
> > >     > > > > >>>>
> > >     > > > > >>
> > >     > > > > >>
> > >     > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     >
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message