metron-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JJ Meyer <jjmey...@gmail.com>
Subject Re: [DISCUSS] Moving GeoIP management away from MySQL
Date Mon, 16 Jan 2017 20:59:26 GMT
Matt, I agree with your points on why we shouldn't just get rid of the
database just to get rid of a database. But IMO, I think we may be
reinventing the wheel a little bit by even putting the maxmind data into
MySQL. Right now we are already downloading a maxmind file. To me it seems
simpler to push the file to HDFS where we can pick it up and have the
maxmind client use that instead of importing data into a DB and then
running a query. Also, I believe the data gets updated weekly. So syncing
may become easier too.

James, I believe it works with the paid and free versions of geoip. I know
NiFi uses this client library in their Geo enrichment processor.

Also, if it is decided that using a SQL database is still the best
solution, I think there is a benefit to using their library. We would just
have to implement a `DatabaseProvider` that hits some SQL db instead of
using their standard implementation.

Thanks,
JJ

On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsirota@apache.org> wrote:

> Hi Guys, I just wanted to clarify one point that I think is lost in this
> tread.  Geo enrichment is NOT a key-value enrichment.  It requires a range
> scan and a join (which is why it's implemented via mySql and not Hbase).
> To account for this access pattern via a key-value store you would
> inevitably have to do something funky or in case of Hbase I don't think
> there is a way to avoid doing a range scan.
>
> With respect to mapdb it only has support for Maps, Sets, Lists, Queues.
> Are we sure it provides enough functionality for us to do this enrichment?
>
> With respect to the Maxmind client, are we sure we can use it on the
> mySql-backed version of their DB?  I thought the Maxmind database itself is
> proprietary and is something you have to pay for.  My understanding is that
> the client is designed for that proprietary version.
>
> I somewhat agree with Matt's point.  If mySql is a problem because of
> licensing, the path of least resistance to remove mySql dependencies would
> be to simply switch to postgresql.  We will always have conventional sql
> databases in our stack because other big data tools use them. Why not take
> advantage of them too?
>
> Thanks,
> James
>
> 16.01.2017, 12:27, "Matt Foley" <mattf@apache.org>:
> > Hi Justin, and team,
> > Several components of the Hadoop Stack utilize a SQL database, usually
> for metadata of some sort. Ambari knows this and arranges for them to share
> a single database installation (on or off the cluster), unless they
> explicitly configure use of different databases (which is allowed for sites
> that desire it). Ambari defaults to using PostgreSQL, altho it’s happy to
> use MySQL, Oracle, or Microsoft, along with whatever each component
> historically defined as their default (such as Derby).
> >
> > If we want to start with a replacement of current functionality, I would
> suggest switching the default database to PostgreSQL. Replacing fast,
> efficient, and proven db services with a file-based api library (but no
> standard way to propagate the underlying storage files) seems to me to be
> taking a step backwards.
> >
> > Sticking with a SQL-based service will surely minimize the amount of
> code changes needed. And making the SQL either dialect-independent or
> capable of switching among dialects, then enables us to do what the rest of
> the Hadoop stack does: allow enterprise customers to substitute Oracle or
> Microsoft enterprise-class databases where they wish. Regarding the
> drivers, we should study what the other Stack components do; I’m not an
> expert in those areas.
> >
> > Using the same db as the rest of the stack also means administrators can
> be confident they’ve set up adequate backup and recovery processes.
> > All these are valuable reasons not to roll our own storage system for
> this enrichment data. IMO, of course.
> >
> > Cheers,
> > --Matt
> >
> > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichardson2@gmail.com>
> wrote:
> >
> >     +1 Agree with David's order
> >
> >     -Kyle
> >
> >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65535@gmail.com>
> wrote:
> >
> >     > Def agree on the parity point.
> >     >
> >     > I'm a little worried about Supervisor relocations for non-HBase
> solutions,
> >     > but having much of the work done for us by MaxMind changes my
> preference to
> >     > (in order)
> >     >
> >     > 1) MM API
> >     > 2) HBase Enrichment
> >     > 3) MapDB should the others prove not feasible
> >     >
> >     >
> >     > -D...
> >     >
> >     >
> >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
> justinjleet@gmail.com>
> >     > wrote:
> >     >
> >     > > I definitely agree on checking out the MaxMind API. I'll take a
> look at
> >     > > it, but at first glance it looks like it does include everything
> we use.
> >     > > Great find, JJ.
> >     > >
> >     > > More details on various people's points:
> >     > >
> >     > > As a note to anyone hopping in, Simon's point on the range
> lookup vs a
> >     > key
> >     > > lookup is why it becomes a Scan in HBase vs a Get. As an
> addendum to
> >     > what
> >     > > Simon mentioned, denormalizing is easy enough and turns it into
> an easy
> >     > > range lookup.
> >     > >
> >     > > To David's point, the MapDB does require a network hop, but it's
> once per
> >     > > refresh of the data (Got a relevant callback? Grab new data,
> load it,
> >     > swap
> >     > > out) instead of (up to) once per message. I would expect the
> same to be
> >     > > true of the MaxMind db files.
> >     > >
> >     > > I'd also argue MapDB not really more complex than refreshing the
> HBase
> >     > > table, because we potentially have to start worrying about
> things like
> >     > > hashing and/or indices and even just general data represtation.
> It's
> >     > > definitely correct that the file processing has to occur on
> either path,
> >     > so
> >     > > it really boils down to handling the callback and reloading the
> file vs
> >     > > handling some of the standard HBasey things. I don't think
> either is an
> >     > > enormous amount of work (and both are almost certainly more work
> than
> >     > > MaxMind's API)
> >     > >
> >     > > Regarding extensibility, I'd argue for parity with what we have
> first,
> >     > then
> >     > > build what we need from there. Does anybody have any
> disagreement with
> >     > > that approach for right now?
> >     > >
> >     > > Justin
> >     > >
> >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
> dlyle65535@gmail.com>
> >     > wrote:
> >     > >
> >     > > > It is interesting- save us a ton of effort, and has the right
> license.
> >     > I
> >     > > > think it's worth at least checking out.
> >     > > >
> >     > > > -D...
> >     > > >
> >     > > >
> >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> >     > > > simon@simonellistonball.com> wrote:
> >     > > >
> >     > > > > I like that approach even more. That way we would only have
> to worry
> >     > > > about
> >     > > > > distributing the database file in binary format to all the
> supervisor
> >     > > > nodes
> >     > > > > on update.
> >     > > > >
> >     > > > > It would also make it easier for people to switch to the
> enterprise
> >     > DB
> >     > > > > potentially if they had the license.
> >     > > > >
> >     > > > > One slight issue with this might be for people who wanted
to
> extend
> >     > the
> >     > > > > database. For example, organisations may want to add
> geo-enrichment
> >     > to
> >     > > > > their own private network addresses based modified versions
> of the
> >     > geo
> >     > > > > database. Currently we don’t really allow this, since
we
> hard-code
> >     > > > ignoring
> >     > > > > private network classes into the geo enrichment adapter,
but
> I can
> >     > see
> >     > > a
> >     > > > > case where a global org might want to add their own ranges
> and
> >     > > locations
> >     > > > to
> >     > > > > the data set. Does that make sense to anyone else?
> >     > > > >
> >     > > > > Simon
> >     > > > >
> >     > > > >
> >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmeyer0@gmail.com>
> wrote:
> >     > > > > >
> >     > > > > > Hello all,
> >     > > > > >
> >     > > > > > Can we leverage maxmind's Java client (
> >     > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> >     > > > > main/java/com/maxmind/geoip2)
> >     > > > > > in this case? I believe it can directly read maxmind
file.
> Plus I
> >     > > think
> >     > > > > it
> >     > > > > > also has some support for caching as well.
> >     > > > > >
> >     > > > > > Thanks,
> >     > > > > > JJ
> >     > > > > >
> >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball
<
> >     > > > > > simon@simonellistonball.com> wrote:
> >     > > > > >
> >     > > > > >> I like the idea of MapDB, since we can essentially
pull an
> >     > instance
> >     > > > into
> >     > > > > >> each supervisor, so it makes a lot of sense for
> relatively small
> >     > > > scale,
> >     > > > > >> relatively static enrichments in general.
> >     > > > > >>
> >     > > > > >> Generally this feels like a caching problem, and
would be
> for a
> >     > > simple
> >     > > > > >> key-value lookup. In that case I would agree with
David
> Lyle on
> >     > > using
> >     > > > > HBase
> >     > > > > >> as a source or truth and relying on caching.
> >     > > > > >>
> >     > > > > >> That said, GeoIP is a different lookup pattern,
since
> it’s a range
> >     > > > > lookup
> >     > > > > >> then a key lookup (or if we denormalize the MaxMind
data,
> just a
> >     > > range
> >     > > > > >> lookup) for that kind of thing MapDB with something
like
> the BTree
> >     > > > > seems a
> >     > > > > >> good fit.
> >     > > > > >>
> >     > > > > >> Simon
> >     > > > > >>
> >     > > > > >>
> >     > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <
> dlyle65535@gmail.com>
> >     > wrote:
> >     > > > > >>>
> >     > > > > >>> I'm +1 on removing the MySQL dependency, BUT
- I'd
> prefer to see
> >     > it
> >     > > > as
> >     > > > > an
> >     > > > > >>> HBase enrichment. If our current caching isn't
enough to
> mitigate
> >     > > the
> >     > > > > >> above
> >     > > > > >>> issues, we have a problem, don't we? Or do
we not
> recommend HBase
> >     > > > > >>> enrichment for per message enrichment in general?
> >     > > > > >>>
> >     > > > > >>> Also- can you elaborate on how MapDB would
not require a
> network
> >     > > hop?
> >     > > > > >>> Doesn't this mean we would have to sync the
enrichment
> data to
> >     > each
> >     > > > > Storm
> >     > > > > >>> supervisor? HDFS could (probably would) have
a network
> hop too,
> >     > no?
> >     > > > > >>>
> >     > > > > >>> Fwiw -
> >     > > > > >>> "In its place, I've looked at using MapDB,
which is a
> really easy
> >     > > to
> >     > > > > use
> >     > > > > >>> library for creating Java collections backed
by a file
> (This is
> >     > > NOT a
> >     > > > > >>> separate installation of anything, it's just
a jar that
> manages
> >     > > > > >> interaction
> >     > > > > >>> with the file system). Given the slow churn
of the GeoIP
> files
> >     > (I
> >     > > > > >> believe
> >     > > > > >>> they get updated once a week), we can have
a script that
> can be
> >     > run
> >     > > > > when
> >     > > > > >>> needed, downloads the MaxMind tar file, builds
the MapDB
> file
> >     > that
> >     > > > will
> >     > > > > >> be
> >     > > > > >>> used by the bolts, and places it into HDFS.
Finally, we
> update a
> >     > > > > config
> >     > > > > >> to
> >     > > > > >>> point to the new file, the bolts get the updated
config
> callback
> >     > > and
> >     > > > > can
> >     > > > > >>> update their db files. Inside the code, we
wrap the MapDB
> >     > portions
> >     > > > to
> >     > > > > >> make
> >     > > > > >>> it transparent to downstream code."
> >     > > > > >>>
> >     > > > > >>> Seems a bit more complex than "refresh the
hbase table".
> Afaik,
> >     > > > either
> >     > > > > >>> approach would require some sort of translation
between
> GeoIP
> >     > > source
> >     > > > > >> format
> >     > > > > >>> and target format, so that part is a wash imo.
> >     > > > > >>>
> >     > > > > >>> So, I'd really like to see, at least, an attempt
to
> leverage
> >     > HBase
> >     > > > > >>> enrichment.
> >     > > > > >>>
> >     > > > > >>> -D...
> >     > > > > >>>
> >     > > > > >>>
> >     > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella
<
> >     > cestella@gmail.com
> >     > > >
> >     > > > > >> wrote:
> >     > > > > >>>
> >     > > > > >>>> I think that it's a sensible thing to use
MapDB for the
> geo
> >     > > > > enrichment.
> >     > > > > >>>> Let me state my reasoning:
> >     > > > > >>>>
> >     > > > > >>>> - An HBase implementation would necessitate
a HBase scan
> >     > > possibly
> >     > > > > >>>> hitting HDFS, which is expensive per-message.
> >     > > > > >>>> - An HBase implementation would necessitate
a network
> hop and
> >     > > MapDB
> >     > > > > >>>> would not.
> >     > > > > >>>>
> >     > > > > >>>> I also think this might be the beginning
of a more
> general
> >     > purpose
> >     > > > > >> support
> >     > > > > >>>> in Stellar for locally shipped, read-only
MapDB
> lookups, which
> >     > > might
> >     > > > > be
> >     > > > > >>>> interesting.
> >     > > > > >>>>
> >     > > > > >>>> In short, all quotes about premature optimization
are
> sure to
> >     > > apply
> >     > > > to
> >     > > > > >> my
> >     > > > > >>>> reasoning, but I can't help but have my
spidey senses
> tingle
> >     > when
> >     > > we
> >     > > > > >>>> introduce a scan-per-message architecture.
> >     > > > > >>>>
> >     > > > > >>>> Casey
> >     > > > > >>>>
> >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima
Kovalyov <
> >     > > > > >> Dima.Kovalyov@sstech.us>
> >     > > > > >>>> wrote:
> >     > > > > >>>>
> >     > > > > >>>>> Hello Justin,
> >     > > > > >>>>>
> >     > > > > >>>>> Considering that Metron uses hbase
tables for storing
> >     > enrichment
> >     > > > and
> >     > > > > >>>>> threatintel feeds, can we use Hbase
for geo enrichment
> as well?
> >     > > > > >>>>> Or MapDB can be used for enrichment
and threatintel
> feeds
> >     > instead
> >     > > > of
> >     > > > > >>>> hbase?
> >     > > > > >>>>>
> >     > > > > >>>>> - Dima
> >     > > > > >>>>>
> >     > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet
wrote:
> >     > > > > >>>>>> Hi all,
> >     > > > > >>>>>>
> >     > > > > >>>>>> As a bit of background, right now,
GeoIP data is
> loaded into
> >     > and
> >     > > > > >>>> managed
> >     > > > > >>>>> by
> >     > > > > >>>>>> MySQL (the connectors are LGPL
licensed and we need
> to sever
> >     > our
> >     > > > > Maven
> >     > > > > >>>>>> dependency on it before next release).
We currently
> depend on
> >     > > and
> >     > > > > >>>> install
> >     > > > > >>>>>> an instance of MySQL (in each of
the Management Pack,
> Ansible,
> >     > > and
> >     > > > > >>>> Docker
> >     > > > > >>>>>> installs). In the topology, we
use the JDBCAdapter to
> connect
> >     > to
> >     > > > > MySQL
> >     > > > > >>>>> and
> >     > > > > >>>>>> query for a given IP. Additionally,
it's a single
> point of
> >     > > > failure
> >     > > > > >> for
> >     > > > > >>>>>> that particular enrichment right
now. If MySQL is
> down, geo
> >     > > > > >> enrichment
> >     > > > > >>>>>> can't occur.
> >     > > > > >>>>>>
> >     > > > > >>>>>> I'm proposing that we eliminate
the use of MySQL
> entirely,
> >     > > through
> >     > > > > all
> >     > > > > >>>>>> installation paths (which, unless
I missed some,
> includes
> >     > > Ansible,
> >     > > > > the
> >     > > > > >>>>>> Ambari Management Pack, and Docker).
We'd do this by
> dropping
> >     > > all
> >     > > > > the
> >     > > > > >>>>>> various MySQL setup and management
through the code,
> along
> >     > with
> >     > > > all
> >     > > > > >> the
> >     > > > > >>>>>> DDL, etc. The JDBCAdapter would
stay, so that anybody
> who
> >     > wants
> >     > > > to
> >     > > > > >>>> setup
> >     > > > > >>>>>> their own databases for enrichments
and install
> connectors is
> >     > > able
> >     > > > > to
> >     > > > > >>>> do
> >     > > > > >>>>> so.
> >     > > > > >>>>>>
> >     > > > > >>>>>> In its place, I've looked at using
MapDB, which is a
> really
> >     > easy
> >     > > > to
> >     > > > > >> use
> >     > > > > >>>>>> library for creating Java collections
backed by a
> file (This
> >     > is
> >     > > > NOT
> >     > > > > a
> >     > > > > >>>>>> separate installation of anything,
it's just a jar
> that
> >     > manages
> >     > > > > >>>>> interaction
> >     > > > > >>>>>> with the file system). Given the
slow churn of the
> GeoIP
> >     > files
> >     > > (I
> >     > > > > >>>>> believe
> >     > > > > >>>>>> they get updated once a week),
we can have a script
> that can
> >     > be
> >     > > > run
> >     > > > > >>>> when
> >     > > > > >>>>>> needed, downloads the MaxMind tar
file, builds the
> MapDB file
> >     > > that
> >     > > > > >> will
> >     > > > > >>>>> be
> >     > > > > >>>>>> used by the bolts, and places it
into HDFS. Finally,
> we
> >     > update
> >     > > a
> >     > > > > >>>> config
> >     > > > > >>>>> to
> >     > > > > >>>>>> point to the new file, the bolts
get the updated
> config
> >     > callback
> >     > > > and
> >     > > > > >>>> can
> >     > > > > >>>>>> update their db files. Inside the
code, we wrap the
> MapDB
> >     > > > portions
> >     > > > > to
> >     > > > > >>>>> make
> >     > > > > >>>>>> it transparent to downstream code.
> >     > > > > >>>>>>
> >     > > > > >>>>>> The particularly nice parts about
using MapDB are
> that its
> >     > ease
> >     > > of
> >     > > > > use
> >     > > > > >>>>> plus
> >     > > > > >>>>>> it offers the utilities we need
out of the box to be
> able to
> >     > > > support
> >     > > > > >>>> the
> >     > > > > >>>>>> operations we need on this (Keep
in mind the GeoIP
> files use
> >     > IP
> >     > > > > ranges
> >     > > > > >>>>> and
> >     > > > > >>>>>> we need to be able to easily grab
the appropriate
> range).
> >     > > > > >>>>>>
> >     > > > > >>>>>> The main point of concern I have
about this is that
> when we
> >     > grab
> >     > > > the
> >     > > > > >>>> HDFS
> >     > > > > >>>>>> file during an update, given that
multiple JVMs can be
> >     > running,
> >     > > we
> >     > > > > >>>> don't
> >     > > > > >>>>>> want them to clobber each other.
I believe this can
> be avoided
> >     > > by
> >     > > > > >>>> simply
> >     > > > > >>>>>> using each worker's working directory
to store the
> file (and
> >     > > > > >>>>> appropriately
> >     > > > > >>>>>> ensure threads on the same JVM
manage
> multithreading). This
> >     > > > should
> >     > > > > >>>> keep
> >     > > > > >>>>>> the JVMs (and the underlying DB
files) entirely
> independent.
> >     > > > > >>>>>>
> >     > > > > >>>>>> This script would get called by
the various
> installations
> >     > during
> >     > > > > >>>> startup
> >     > > > > >>>>> to
> >     > > > > >>>>>> do the initial setup. After install,
it can then be
> called on
> >     > > > > demand
> >     > > > > >>>> in
> >     > > > > >>>>>> order.
> >     > > > > >>>>>>
> >     > > > > >>>>>> At this point, we should be all
set, with everything
> running
> >     > and
> >     > > > > >>>>> updatable.
> >     > > > > >>>>>>
> >     > > > > >>>>>> Justin
> >     > > > > >>>>>>
> >     > > > > >>>>>
> >     > > > > >>>>>
> >     > > > > >>>>
> >     > > > > >>
> >     > > > > >>
> >     > > > >
> >     > > > >
> >     > > >
> >     > >
> >     >
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message